ENCODED PRIOR SLICED WASSERSTEIN AUTOEN-CODER FOR LEARNING LATENT MANIFOLD REPRESEN-TATIONS

Abstract

While variational autoencoders have been successful in a variety of tasks, the use of conventional Gaussian or Gaussian mixture priors are limited in their ability to encode underlying structure of data in the latent representation. In this work, we introduce an Encoded Prior Sliced Wasserstein AutoEncoder (EPSWAE) wherein an additional prior-encoder network facilitates learns an embedding of the data manifold which preserves topological and geometric properties of the data, thus improving the structure of latent space. The autoencoder and prior-encoder networks are iteratively trained using the Sliced Wasserstein (SW) distance, which efficiently measures the distance between two arbitrary sampleable distributions without being constrained to a specific form as in the KL divergence, and without requiring expensive adversarial training. To improve the representation, we use (1) a structural consistency term in the loss that encourages isometry between feature space and latent space and (2) a nonlinear variant of the SW distance which averages over random nonlinear shearing. The effectiveness of the learned manifold encoding is best explored by traversing the latent space through interpolations along geodesics which generate samples that lie on the manifold and hence are advantageous compared to standard Euclidean interpolation. To this end, we introduce a graph-based algorithm for interpolating along network-geodesics in latent space by maximizing the density of samples along the path while minimizing total energy. We use the 3D-spiral data to show that the prior does indeed encode the geometry underlying the data and to demonstrate the advantages of the network-algorithm for interpolation. Additionally, we apply our framework to MNIST, and CelebA datasets, and show that outlier generations, latent representations, and geodesic interpolations are comparable to the state of the art.

1. INTRODUCTION

Generative models have the potential to capture rich representations of data and use them to generate realistic outputs. In particular, Variational AutoEncoders (VAEs) (Kingma & Welling, 2014) can capture important properties of high-dimensional data in their latent embeddings, and sample from a prior distribution to generate realistic images. Whille VAEs have been very successful in a variety of tasks, the use of a simplistic standard normal prior is known to cause problems such as under-fitting and over-regularization, and fails to use the network's entire modeling capacity (Burda et al., 2016) . Gaussian or Gaussian mixture model (GMM) priors are also limited in their ability to represent geometric and topological properties of the underlying data manifold. High-dimensional data can typically be modeled as lying on or near an embedded low-dimensional, nonlinear manifold (Fefferman et al., 2016) . Learning improved latent representations of this nonlinear manifold is an important problem, for which a more flexible prior may be desirable. Conventional variational inference uses Kullback-Leibler (KL) divergence as a measure of distance between the posterior and the prior, restricting the prior distribution to cases that have tractable approximations of the KL divergence. Many works such as Guo et al. (2020) ; Tomczak & Welling (2018); Rezende & Mohamed (2015) etc. have investigated the use of more complicated priors (notably GMMs) which lead to improved latent representation and generation compared to a single Gaussion prior. Alternate approaches such as adversarial training learn arbitrary priors by using a discriminator network to compute a divergence (Wang et al., 2020) , however they are reported to be harder to train and are computationally expensive. In this work, we introduce the Encoded Prior Sliced Wasserstein AutoEncoder (EPSWAE), which consists of a conventional autoencoder architecture and an additional prior-encoder network that learns an unconstrained prior distribution that encodes the geometry and topology of any data manifold. We use a type of Sliced Wasserstein (SW) distance (Bonnotte, 2013; Bonneel et al., 2015) , a concept from optimal transport theory that is a simple and convenient alternative to the KL divergence for any sampleable distributions. A Sliced Wasserstein AutoEncoder (SWAE) that regularizes an autoencoder using SW distance was proposed in Kolouri et al. (2018a) . Several works improve the SW distance through additional optimizations (Deshpande et al., 2019; Chen et al., 2020b; Deshpande et al., 2018) , and show improved generation, however involve additional training and use a fixed (usually Gaussian) prior. Kolouri et al. (2019) presents a comparison between max-SW distance, polynomial generalized SW distances, and their combinations. In contrast, we use a simple and efficient nonlinear shearing which requires no additional optimization. Additionally, we introduce a structural consistency term that encourages the latent space to be isometric to the feature space, which is typically measured at the output of the convolutional layers of the data encoder. Variants of this penalty have previously been used to encourage isometry between the latent space and data space (Yu et al., 2013; Benaim & Wolf, 2017; Sainburg et al., 2018) . The structural consistency term further encourages the prior to match the encoded data manifold by preserving feature-isometry, which in turn is expected to assist with encoding the geometry of the data manifold , thus leading to improved latent representations. A key contribution of our work is the graph-based geodesic-interpolation algorithm. Conventionally, VAEs use Euclidean interpolation between two points in latent space. However, since manifolds typically have curvature, this is an unintuitive distance metric that can lead to unrealistic intermediate points. Our goal is to learn a true representation of the underlying data manifold, hence it is natural to interpolate along the manifold geodesics in latent space. Several works such as Shao et al. ( 2018); Miolane & Holmes (2020b) endow the latent space with a Riemannian geometry and measure corresponding distances, however these are difficult and involve explicitly solving expensive ordinary differential equations. In this work, we introduce 'network-geodesics', a graph-based method for interpolating along a manifold in latent space, that maximizes sample density along paths while minimizing total energy. This involves first generating a distance graph between samples from the prior. Then this network is non-uniformly thresholded such that the set of allowable paths from a given sample traverse high density regions through short hops. Lastly, we use a shortest path algorithm like Dijkstra's algorithm (Dijkstra, 1959) to identify the lowest 'energy' path between two samples through the allowable paths. Since the prior is trained to learn the data manifold, the resulting network-geodesic curves give a notion of distance on the manifold and can be used to generate realistic interpolation points with relatively few prior samples. The novel contributions of this work are: • We introduce a novel architecture, EPSWAE, that consists of a prior-encoder network that is efficiently trained (without expensive adversarial methods) to generate a prior that encodes the geometric and topological properties of data. • We introduce a novel graph-based method for interpolating along network-geodesics in latent space through maximizing sample density while minimizing total energy. We show that it generates natural interpolations through realistic images. • Improvements to the latent space representation are obtained by using a structural consistency term in the loss that encourages isometry between feature space and latent space and by using a simple and efficient nonlinear variant of the SW distance.

2. BACKGROUND AND RELATED WORK

Several works have attempted to increase the complexity of the prior in order to obtain better latent representations. Most data can mathematically be thought of as living on a high dimensional manifold. In an image dataset, for instance, if images in high dimensional pixel space are effectively parametrized using a small number of continuous variables, they will lie on or near a low dimensional manifold (Lu et al., 1998) . Many works such as Weinberger & Saul (2006) ; Rahimi et al.

