SYMMETRIC WASSERSTEIN AUTOENCODERS

Abstract

Leveraging the framework of Optimal Transport, we introduce a new family of generative autoencoders with a learnable prior, called Symmetric Wasserstein Autoencoders (SWAEs). We propose to symmetrically match the joint distributions of the observed data and the latent representation induced by the encoder and the decoder. The resulting algorithm jointly optimizes the modelling losses in both the data and the latent spaces with the loss in the data space leading to the denoising effect. With the symmetric treatment of the data and the latent representation, the algorithm implicitly preserves the local structure of the data in the latent space. To further improve the latent representation, we incorporate a reconstruction loss into the objective, which significantly benefits both the generation and reconstruction. We empirically show the superior performance of SWAEs over the state-of-the-art generative autoencoders in terms of classification, reconstruction, and generation.

1. INTRODUCTION

Deep generative models have emerged as powerful frameworks for modelling complex data. Widely used families of such models include Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) , Variational Autoencoders (VAEs) (Rezende et al., 2014; Kingma & Welling, 2014) , and autoregressive models (Uria et al., 2013; Van Oord et al., 2016) . The VAE-based framework has been popular as it yields a bidirectional mapping, i.e., it consists of both an inference model (from data to latent space) and a generative model (from latent to data space). With an inference mechanism VAEs can provide a useful latent representation that captures salient information about the observed data. Such latent representation can in turn benefit downstream tasks such as clustering, classification, and data generation. In particular, the VAE-based approaches have achieved impressive performance results on challenging real-world applications, including image synthesizing (Razavi et al., 2019) , natural text generation (Hu et al., 2017) , and neural machine translation (Sutskever et al., 2014) . VAEs aim to maximize a tractable variational lower bound on the log-likelihood of the observed data, commonly called the ELBO. Since VAEs focus on modelling the marginal likelihood of the data instead of the joint likelihood of the data and the latent representation, the quality of the latent is not well assessed (Alemi et al., 2017; Zhao et al., 2019) , which is undesirable for learning useful representation. Besides the perspective of maximum-likelihood learning of the data, the objective of VAEs is equivalent to minimizing the KL divergence between the encoding and the decoding distributions, with the former modelling the joint distribution of the observed data and the latent representation induced by the encoder and the latter modelling the corresponding joint distribution induced by the decoder. Such connection has been revealed in several recent works (Livne et al., 2019; Esmaeili et al., 2019; Pu et al., 2017b; Chen et al., 2018) . Due to the asymmetry of the KL divergence, it is highly likely that the generated samples are of a low probability in the data distribution, which often leads to unrealistic generated samples (Li et al., 2017b; Alemi et al., 2017) . A lot of work has proposed to improve VAEs from different perspectives. For example, to enhance the latent expressive power VampPrior (Tomczak & Welling, 2018) , normalizing flow (Rezende & Mohamed, 2015), and Stein VAEs (Pu et al., 2017a) replace the Gaussian distribution imposed on the latent variables with a more sophisticated and flexible distribution. However, these methods are all based on the objective of VAEs, which therefore are unable to alleviate the limitation of VAEs induced by the objective. To improve the latent representation (Zhao et al., 2019) explicitly includes the mutual information between the data and the latent into the objective. Moreover, to address the asymmetry of the KL divergence in VAEs (Livne et al., 2019; Chen et al., 2018; Pu et al., 2017b) leverage a symmetric divergence measure between the encoding and the decoding distributions.

