ON THE LATENT SPACE OF FLOW-BASED MODELS Anonymous authors Paper under double-blind review

Abstract

Flow-based generative models typically define a latent space with dimensionality identical to the observational space. In many problems, however, the data does not populate the full ambient data-space that they natively reside in, but rather inhabit a lower-dimensional manifold. In such scenarios, flow-based models are unable to represent data structures exactly as their density will always have support off the data manifold, potentially resulting in degradation of model performance. In addition, the requirement for equal latent and data space dimensionality can unnecessarily increase model complexity for contemporary flow models. Towards addressing these problems, we propose to learn a manifold prior that affords benefits to both the tasks of sample generation and representation quality. An auxiliary product of our approach is that we are able to identify the intrinsic dimension of the data distribution.

1. INTRODUCTION

Normalizing flows (Rezende and Mohamed, 2015; Kobyzev et al., 2020) have shown considerable potential for the tasks of modelling and inferring expressive distributions through the learning of well-specified probabilistic models. Contemporary flow-based approaches define a latent space with dimensionality identical to the data space, typically by parameterizing a complex model p X (x|θ) using an invertible neural network f θ . Samples drawn from an initial, simple distribution p Z (z) (e.g. Gaussian) can be mapped to a complex distribution as x = f θ (z). The process results in a tractable density that inhabits the full data space. However, contemporary flow models may make for an inappropriate choice to represent data that resides in a lower-dimensional manifold and thus does not populate the full ambient space. In such cases, the estimated model will necessarily have mass lying off the data manifold, which may result in under-fitting and poor generation qualities. Furthermore, principal objectives such as Maximum Likelihood Estimation (MLE) and Kullback-Leibler (KL) divergence minimization are ill-defined, bringing additional challenges for model training. In this work, we propose a principled strategy to model a data distribution that lies on a continuous manifold and we additionally identify the intrinsic dimension of the data manifold. Specifically, by using the connection between MLE and KL divergence minimization in Z space, we can address the important problem of ill-defined KL divergence under typical flow based assumptions. Flow models are based on the idea of "change of variable". Assume a random variable Z with distribution P Z and probability density p Z (z). We can transform Z to get a random variable X: X = f (Z), where f : R D → R D is an invertible function with inverse f -1 = g. Suppose X has distribution P X and density function p X (x), then log p X (x) will have the following form log p X (x) = log p Z (g(x)) + log det ∂g ∂x , where log det ∂g ∂x is the log determinant of the Jacobian matrix. We call f (or g) a volumepreserving function if the log determinant is equal to 0. Training of flow models typically makes use of MLE. We denote X d as the random variable of the data with distribution P d and density p d (x). In addition to the well-known connection between MLE and minimization of the KL divergence KL(p d (x)||p X (x)) in X space (see Appendix A for detail), MLE is also (approximately) equivalent to minimizing the KL divergence in Z space, this is due to the KL divergence is invariant under invertible transformation (Yeung, 2008; Papamakarios et al., = -p d (x) log p Z (g(x)) + log det ∂g ∂x dx + const., The full derivation can be found in Appendix A. Since we can only access samples x 1 , x 2 , . . . , x N from p d (x), we approximate the integral by Monte Carlo sampling KL(q(z)||p(z)) ≈ - 1 N N n=1 log p X (x n ) + const.. We thus highlight the connection between MLE and KL divergence minimization, in Z space, for flow based models. The prior distribution p(z) is usually chosen to be a D-dimensional Gaussian distribution. However, if the data distribution P d is singular, for example a measure on a low dimensional manifold, the induced latent distribution Q Z is also singular. In this case, the KL divergence in equation 2 is typically not well-defined under the considered flow based model assumptions. This issue brings both theoretical and practical challenges that we will discuss in the following section.

2. FLOW MODELS WITH MANIFOLD DATA

We assume a data sample x ∼ P d to be a D dimensional vector x ∈ R D and define the ambient dimensionality of P d , denoted by Amdim(P d ), to be D. However for many datasets of interest, e.g. natural images, the data distribution P d is commonly believed to be supported on a lower dimensional manifold (Beymer and Poggio, 1996) . We assume the dimensionality of the manifold to be K where K < D, and define the intrinsic dimension of P d , denoted by Indim(P d ), to be the dimension of this manifold. Figure 1a provides an example of this setting where P d is a 1D distribution in 2D space. Specifically, each data sample x ∼ P d is a 2D vector x = {x 1 , x 2 } where x 1 ∼ N (0, 1) and x 2 = sin(2x 1 ). Therefore, this example results in Amdim(P d ) = 2 and Indim(P d ) = 1. In flow-based models, function f is constructed such that it is both bijective and differentiable. When the prior P Z is a distribution whose support is R D (e.g. Multivariate Gaussian distribution), the marginal distribution P X will also have support R D and Amdim(P X ) = Indim(P X ) = D. When the support of the data distribution lies on a K-dimensional manifold and K < D, P d and P X are constrained to have different support. That is, the intrinsic dimensions of P X and P d are always different; Indim(P X ) = Indim(P d ). In this case it is impossible to learn a model distribution P X identical to the data distribution P d . Nevertheless, flow-based models have shown strong empirical success in real-world problem domains such as the ability to generate high quality and realistic images (Kingma and Dhariwal, 2018) . Towards investigating the cause and explaining this disparity between theory and practice, we employ a toy example to provide intuition for the effects and consequences resulting from model and data distributions that possess differing intrinsic dimension. Consider the toy dataset introduced previously; a 1D distribution lying in a 2D space (Figure 1a ). The prior density p(z) is a standard 2D Gaussian p(z) = N (0, I Z ) and the function f is a nonvolume preserving flow with two coupling layers (see Appendix C.1). In Figure 1b we plot samples from the flow model; the sample x is generated by first sampling a 2D datapoint z ∼ N (0, I Z ) and then letting x = f (z). Figure 1c shows samples from the prior distributions P Z and Q Z . Q Z is defined as the transformation of P d using the bijective function g, such that Q Z is constrained to support a 1D manifold in 2D space, and Indim(Q Z ) = Indim(P d ) = 1. Training of Q Z to match P Z (which has intrinsic dimension 2), can be seen in Figure 1c to result in curling up of the manifold in the latent space, contorting it towards satisfying a distribution that has intrinsic dimension 2. This ill-behaved phenomenon causes several potential problems for contemporary flow models: 1. Poor sample quality. Figure 1b shows examples where incorrect assumptions in turn result in the model generating bad samples. 2. Low quality data representations. The discussed characteristic that results in "curling up" of the latent space may cause degradations of the representation quality.

