SYMMETRIC WASSERSTEIN AUTOENCODERS

Abstract

Leveraging the framework of Optimal Transport, we introduce a new family of generative autoencoders with a learnable prior, called Symmetric Wasserstein Autoencoders (SWAEs). We propose to symmetrically match the joint distributions of the observed data and the latent representation induced by the encoder and the decoder. The resulting algorithm jointly optimizes the modelling losses in both the data and the latent spaces with the loss in the data space leading to the denoising effect. With the symmetric treatment of the data and the latent representation, the algorithm implicitly preserves the local structure of the data in the latent space. To further improve the latent representation, we incorporate a reconstruction loss into the objective, which significantly benefits both the generation and reconstruction. We empirically show the superior performance of SWAEs over the state-of-the-art generative autoencoders in terms of classification, reconstruction, and generation.

1. INTRODUCTION

Deep generative models have emerged as powerful frameworks for modelling complex data. Widely used families of such models include Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) , Variational Autoencoders (VAEs) (Rezende et al., 2014; Kingma & Welling, 2014) , and autoregressive models (Uria et al., 2013; Van Oord et al., 2016) . The VAE-based framework has been popular as it yields a bidirectional mapping, i.e., it consists of both an inference model (from data to latent space) and a generative model (from latent to data space). With an inference mechanism VAEs can provide a useful latent representation that captures salient information about the observed data. Such latent representation can in turn benefit downstream tasks such as clustering, classification, and data generation. In particular, the VAE-based approaches have achieved impressive performance results on challenging real-world applications, including image synthesizing (Razavi et al., 2019) , natural text generation (Hu et al., 2017) , and neural machine translation (Sutskever et al., 2014) . VAEs aim to maximize a tractable variational lower bound on the log-likelihood of the observed data, commonly called the ELBO. Since VAEs focus on modelling the marginal likelihood of the data instead of the joint likelihood of the data and the latent representation, the quality of the latent is not well assessed (Alemi et al., 2017; Zhao et al., 2019) , which is undesirable for learning useful representation. Besides the perspective of maximum-likelihood learning of the data, the objective of VAEs is equivalent to minimizing the KL divergence between the encoding and the decoding distributions, with the former modelling the joint distribution of the observed data and the latent representation induced by the encoder and the latter modelling the corresponding joint distribution induced by the decoder. Such connection has been revealed in several recent works (Livne et al., 2019; Esmaeili et al., 2019; Pu et al., 2017b; Chen et al., 2018) . Due to the asymmetry of the KL divergence, it is highly likely that the generated samples are of a low probability in the data distribution, which often leads to unrealistic generated samples (Li et al., 2017b; Alemi et al., 2017) . A lot of work has proposed to improve VAEs from different perspectives. For example, to enhance the latent expressive power VampPrior (Tomczak & Welling, 2018) , normalizing flow (Rezende & Mohamed, 2015), and Stein VAEs (Pu et al., 2017a) replace the Gaussian distribution imposed on the latent variables with a more sophisticated and flexible distribution. However, these methods are all based on the objective of VAEs, which therefore are unable to alleviate the limitation of VAEs induced by the objective. To improve the latent representation (Zhao et al., 2019) explicitly includes the mutual information between the data and the latent into the objective. Moreover, to address the asymmetry of the KL divergence in VAEs (Livne et al., 2019; Chen et al., 2018; Pu et al., 2017b) leverage a symmetric divergence measure between the encoding and the decoding distributions. In this paper, we leverage Optimal Transport (OT) (Villani, 2008; Peyré et al., 2019) to symmetrically match the encoding and the decoding distributions. The OT optimization is generally challenging particularly in high dimension, and we address this difficulty by transforming the OT cost into a simpler form amenable to efficient numerical implementation. Owing to the symmetric treatment of the observed data and the latent representation, the local structure of the data can be implicitly preserved in the latent space. However, we found that with the symmetric treatment only the performance of the generative model may not be satisfactory. To improve the generative model we additionally include a reconstruction loss into the objective, which is shown to significantly benefit the quality of the generation and reconstruction. Our contributions can be summarized as follows. Firstly, we propose a new family of generative autoencoders, called Symmetric Wasserstein Autoencoders (SWAEs). Secondly, we adopt a learnable latent prior, parameterized as a mixture of the conditional priors given the learnable pseudo-inputs, which prevents SWAEs from over-regularizing the latent variables. Thirdly, we empirically perform an ablation study of SWAEs in terms of the KNN classification, denoising, reconstruction, and sample generation. Finally, we empirically verify, using benchmark tasks, the superior performance of SWAEs over several state-of-the-art generative autoencoders.

2. SYMMETRIC WASSERSTEIN AUTOENCODERS

In this section we introduce a new family of generative autoencoders, called Symmetric Wasserstein Autoencoders (SWAEs).

2.1. OT FORMULATION

Denote the random vector at the encoder as e (x e , z e ) ∈ X ×Z, which contains both the observed data x e ∈ X and the latent representation z e ∈ Z. We call the distribution p(e) ∼ p(x e )p(z e |x e ) the encoding distribution, where p(x e ) represents the data distribution and p(z e |x e ) characterizes an inference model. Similarly, denote the random vector at the decoder as d (x d , z d ) ∈ X × Z, which consists of both the latent prior z d ∈ Z and the generated data x d ∈ X . We call the distribution p(d) ∼ p(z d )p(x d |z d ) the decoding distribution, where p(z d ) represents the prior distribution and p(x d |z d ) characterizes a generative model. The objective of VAEs is equivalent to minimizing the (asymmetric) KL divergence between the encoding distribution p(e) and the decoding distribution p(d) (see Appendix A.1). To address the limitation in VAEs, first we propose to treat the data and the latent representation symmetrically instead of asymmetrically by minimizing the pth Wasserstein distance between p(e) and p(d) leveraging Optimal Transport (OT) (Villani, 2008; Peyré et al., 2019) . OT provides a framework for comparing two distributions in a Lagrangian framework, which seeks the minimum cost for transporting one distribution to another. We focus on the primal problem of OT, and Kantorovich's formulation (Peyré et al., 2019) is given by: In particular, it can be proved that the p-th Wasserstein distance is a metric hence symmetric, and metrizes the weak convergence (see, e.g., (Santambrogio, 2015) ). Optimization of equation 1 is computationally prohibitive especially in high dimension (Peyré et al., 2019) . To provide an efficient solution, we restrict to the deterministic encoder and decoder. Specifically, at the encoder we have the latent representation z e = E(x e ) with the function E : X → Z, and at the decoder we have the generated data x d = D(z d ) with the function D : Z → X . It turns out that with the deterministic condition instead of searching for an optimal coupling in high dimension, we can find a proper conditional distribution p(z d |x e ) with the marginal p(z d ).



c (p(e), p(d)) inf Γ∈P(e∼p(e),d∼p(d)) E (e,d)∼Γ c(e, d), (1) where P(e ∼ p (e), d ∼ p(d)), called the coupling between e and d, denotes the set of the joint distributions of e and d with the marginals p(e) and p(d), respectively, and c(e, d) : (X , Z) × (X , Z) → [0, +∞] denotes the cost function. When ((X , Z) × (X , Z), d) is a metric space and the cost function c(e, d) = d p (e, d) for p ≥ 1, W p , the p-th root of W c is defined as the p-th Wasserstein distance.

