ON THE LATENT SPACE OF FLOW-BASED MODELS Anonymous authors Paper under double-blind review

Abstract

Flow-based generative models typically define a latent space with dimensionality identical to the observational space. In many problems, however, the data does not populate the full ambient data-space that they natively reside in, but rather inhabit a lower-dimensional manifold. In such scenarios, flow-based models are unable to represent data structures exactly as their density will always have support off the data manifold, potentially resulting in degradation of model performance. In addition, the requirement for equal latent and data space dimensionality can unnecessarily increase model complexity for contemporary flow models. Towards addressing these problems, we propose to learn a manifold prior that affords benefits to both the tasks of sample generation and representation quality. An auxiliary product of our approach is that we are able to identify the intrinsic dimension of the data distribution.

1. INTRODUCTION

Normalizing flows (Rezende and Mohamed, 2015; Kobyzev et al., 2020) have shown considerable potential for the tasks of modelling and inferring expressive distributions through the learning of well-specified probabilistic models. Contemporary flow-based approaches define a latent space with dimensionality identical to the data space, typically by parameterizing a complex model p X (x|θ) using an invertible neural network f θ . Samples drawn from an initial, simple distribution p Z (z) (e.g. Gaussian) can be mapped to a complex distribution as x = f θ (z). The process results in a tractable density that inhabits the full data space. However, contemporary flow models may make for an inappropriate choice to represent data that resides in a lower-dimensional manifold and thus does not populate the full ambient space. In such cases, the estimated model will necessarily have mass lying off the data manifold, which may result in under-fitting and poor generation qualities. Furthermore, principal objectives such as Maximum Likelihood Estimation (MLE) and Kullback-Leibler (KL) divergence minimization are ill-defined, bringing additional challenges for model training. In this work, we propose a principled strategy to model a data distribution that lies on a continuous manifold and we additionally identify the intrinsic dimension of the data manifold. Specifically, by using the connection between MLE and KL divergence minimization in Z space, we can address the important problem of ill-defined KL divergence under typical flow based assumptions. Flow models are based on the idea of "change of variable". Assume a random variable Z with distribution P Z and probability density p Z (z). We can transform Z to get a random variable X: X = f (Z), where f : R D → R D is an invertible function with inverse f -1 = g. Suppose X has distribution P X and density function p X (x), then log p X (x) will have the following form log p X (x) = log p Z (g(x)) + log det ∂g ∂x , where log det ∂g ∂x is the log determinant of the Jacobian matrix. We call f (or g) a volumepreserving function if the log determinant is equal to 0. Training of flow models typically makes use of MLE. We denote X d as the random variable of the data with distribution P d and density p d (x). In addition to the well-known connection between MLE and minimization of the KL divergence KL(p d (x)||p X (x)) in X space (see Appendix A for detail), MLE is also (approximately) equivalent to minimizing the KL divergence in Z space, this is due to the KL divergence is invariant under invertible transformation (Yeung, 2008; Papamakarios et al., 2019) . Specifically, we define Z Q : Z Q = g(X d ) with distribution Q Z and density function q(z), the KL divergence in Z space can be written as KL(q(z)||p(z)) = q(z) log q(z)dz -q(z) log p(z)dz (2) = -p d (x) log p Z (g(x)) + log det ∂g ∂x dx + const., The full derivation can be found in Appendix A. Since we can only access samples x 1 , x 2 , . . . , x N from p d (x), we approximate the integral by Monte Carlo sampling KL(q(z)||p(z)) ≈ - 1 N N n=1 log p X (x n ) + const.. We thus highlight the connection between MLE and KL divergence minimization, in Z space, for flow based models. The prior distribution p(z) is usually chosen to be a D-dimensional Gaussian distribution. However, if the data distribution P d is singular, for example a measure on a low dimensional manifold, the induced latent distribution Q Z is also singular. In this case, the KL divergence in equation 2 is typically not well-defined under the considered flow based model assumptions. This issue brings both theoretical and practical challenges that we will discuss in the following section.

2. FLOW MODELS WITH MANIFOLD DATA

We assume a data sample x ∼ P d to be a D dimensional vector x ∈ R D and define the ambient dimensionality of P d , denoted by Amdim(P d ), to be D. However for many datasets of interest, e.g. natural images, the data distribution P d is commonly believed to be supported on a lower dimensional manifold (Beymer and Poggio, 1996) . We assume the dimensionality of the manifold to be K where K < D, and define the intrinsic dimension of P d , denoted by Indim(P d ), to be the dimension of this manifold. Figure 1a provides an example of this setting where P d is a 1D distribution in 2D space. Specifically, each data sample x ∼ P d is a 2D vector x = {x 1 , x 2 } where x 1 ∼ N (0, 1) and x 2 = sin(2x 1 ). Therefore, this example results in Amdim(P d ) = 2 and Indim(P d ) = 1. In flow-based models, function f is constructed such that it is both bijective and differentiable. When the prior P Z is a distribution whose support is R D (e.g. Multivariate Gaussian distribution), the marginal distribution P X will also have support R D and Amdim(P X ) = Indim(P X ) = D. When the support of the data distribution lies on a K-dimensional manifold and K < D, P d and P X are constrained to have different support. That is, the intrinsic dimensions of P X and P d are always different; Indim(P X ) = Indim(P d ). In this case it is impossible to learn a model distribution P X identical to the data distribution P d . Nevertheless, flow-based models have shown strong empirical success in real-world problem domains such as the ability to generate high quality and realistic images (Kingma and Dhariwal, 2018) . Towards investigating the cause and explaining this disparity between theory and practice, we employ a toy example to provide intuition for the effects and consequences resulting from model and data distributions that possess differing intrinsic dimension. Consider the toy dataset introduced previously; a 1D distribution lying in a 2D space (Figure 1a ). The prior density p(z) is a standard 2D Gaussian p(z) = N (0, I Z ) and the function f is a nonvolume preserving flow with two coupling layers (see Appendix C.1). In Figure 1b we plot samples from the flow model; the sample x is generated by first sampling a 2D datapoint z ∼ N (0, I Z ) and then letting x = f (z). Figure 1c shows samples from the prior distributions P Z and Q Z . Q Z is defined as the transformation of P d using the bijective function g, such that Q Z is constrained to support a 1D manifold in 2D space, and Indim(Q Z ) = Indim(P d ) = 1. Training of Q Z to match P Z (which has intrinsic dimension 2), can be seen in Figure 1c to result in curling up of the manifold in the latent space, contorting it towards satisfying a distribution that has intrinsic dimension 2. This ill-behaved phenomenon causes several potential problems for contemporary flow models: 1. Poor sample quality. Figure 1b shows examples where incorrect assumptions in turn result in the model generating bad samples. 2. Low quality data representations. The discussed characteristic that results in "curling up" of the latent space may cause degradations of the representation quality. 3. Inefficient use of network capacity. Neural network capacity is spent on contorting the distribution Q Z to satisfy imposed dimensionality constraints. A natural solution to the problem of intrinsic dimension mismatch is to select a prior distribution P Z with the same dimensionality as the intrinsic dimension of the data distribution such that: Indim(P Z ) = Indim(P d ). However, since we do not know Indim(P d ) explicitly, one option involves to instead learn it from the data distribution. In the following section, we will introduce a parameterization approach that enables us to learn Indim(P d ). 3 LEARNING A MANIFOLD PRIOR Figure 2 : The data (black dots) lies on a 2D manifold in 3D space. To have a model P X with Indim(P X ) = 2, we learn a prior P Z that is a 2D Gaussian in 3D space and an invertible function f which maps from P Z to P X . Consider a data vector x ∈ R D , then a flow based model prior P Z is usually given by a Ddimensional Gaussian distribution or alternative simple distribution that is also absolutely continuous (a.c.) in R D . Therefore, the intrinsic dimension Indim(P Z ) = D. To allow a prior to have intrinsic dimension strictly less than D, we let P Z have the 'generalized density'foot_0 p(z) with the form p(z) = N (0, AA T ), where z ∈ R D and A is a D×D lower triangular matrix with D(D+1) 2 parameters, such that AA T is constrained to be a positive semi-definite matrix. When AA T has full rank D, then P Z is a (nondegenerate) multivariate Gaussian on R D . When Rank(AA T ) = K and K < D, then P Z will degenerate to a Gaussian supported on a K-dimensional manifold, such that the intrinsic dimension Indim(P Z ) = K. Figure 2 illustrates a sketch of this scenario. In practice, we initialize A to be an identity matrix, so AA T will also be an identity matrix and P Z is initialized as a standard Gaussian. When Rank(AA T ) < D, the degenerate covariance AA T is no longer invertible and we are unable to evaluate the density value of p(z) for a given random vector z. Furthermore, when the data distribution P d is supported on a K-dimensional manifold, Q Z will also be supported on a Kdimensional manifold and no longer has a valid density function. Using equation 2 to train the flow model then becomes impossible as the KL divergence between P Z and Q Z is not well definedfoot_1 . Recent work by Zhang et al. (2020) proposed a new family of divergence to address this problem. In the following section we briefly review the key concepts pertaining to this family of divergences.

4. FIX THE ILL-DEFINED KL DIVERGENCE

random Let Z Q and Z P be two random variables with distribution Q Z and P Z . The KL divergence between Q Z and P Z is not defined if Q Z or P Z does not have valid density function. Let K be an a.c. random variable that is independent of Z Q and Z P and has density p K , We define Z P = Z P + K; Z Q = Z Q + K with distributions PZ and QZ respectively. Then PZ and QZ are a.c. with density functions q(z) = z p K (z -z)dQ Z p(z) = z p K (z -z)dP Z . We can thus define the spread KL divergence between Q Z and P Z as the KL divergence between QZ and PZ as: KL(Q Z ||P Z ) ≡ KL( QZ || PZ ) ≡ KL (q(z)||p(z)) . In this work we let K be a Gaussian with diagonal covariance σ 2 Z I to satisfy the sufficient conditions such that KL is a valid divergence (see Zhang et al. (2020) for details) and has the properties: KL(Q Z ||P Z ) ≥ 0, KL(Q Z ||P Z ) = 0 ⇔ Q Z = P Z . ( ) Since Q Z and P Z are transformed from P d and P X using an invertible function g, we have Q Z = P Z ⇔ P d = P X . Therefore, the spread KL divergence can be used to train flow based models with a manifold prior in order to fit a dataset that lies on a lower-dimensional manifold.

5. IDENTIFIABILITY OF INTRINSIC DIMENSION

A byproduct of our model is that the intrinsic dimension of the data manifold can be identified. Section 4 shows that when KL(Q Z ||P Z ) = 0 ⇔ P X = P d , the supports of P X and P d will also have the same intrinsic dimension: Indim(P X ) = Indim(P d ). The flow function g, and its inverse g = f -1 , are bijective and continuous so f is a diffeomorphism (Kobyzev et al., 2020) . Due to the invariance of dimension property of diffeomorphisms (Lee, 2013, Theorem 2.17), the manifold that supports P Z will have the same dimension as the manifold that supports P X . Therefore, we have Indim(P Z ) = Indim(P X ) = Indim(P d ). Since the intrinsic dimension of P Z is equal to the rank of the matrix AA T , we can calculate Rank(AA T ) by counting the number of non-zero eigenvalues of the matrix AA T . This allows for identification of the intrinsic dimension of the data distribution as Indim(P d ) = Rank(AA T ). We have shown that we can identify the intrinsic dimension of P d using the spread KL divergence. In the next section, we will discuss how to estimate the spread KL divergence in practice.

6. ESTIMATION OF THE SPREAD KL DIVERGENCE

Our goal is to minimize the spread KL divergence between Q Z and P Z . Using our definition of spread divergence (equation 7), this is equivalent to minimizing KL (q(z)||p(z)) = q(z) log q(z)dz Term 1 -q(z) log p(z)dz Term 2 . ( ) where q(z) and p(z) are defined in equation 6. We separate the objective into two terms and now discuss the estimation for each of them. Term 1: We use H(•) to denote the differential entropy. Term 1 is the negative entropy -H(Z Q). For a volume preserving g and X d is a.c., the entropy H(Z Q ) = H(X d ) and is independent of the model parameters and can be ignored during training. However, the entropy H(Z Q) = H(Z Q + K) will still depend on g, see Appendix B.1 for an example. We claim that when the variance of K is small, the dependency between the H(Z Q) and volume preserving function g is weak, thus we can approximate equation 12 by leaving out term 1 and will not affect the training. To build intuitions, we first assume X d is a.c., so Z Q = g(X d ) is also a.c.. Using standard entropic properties (Kontoyiannis and Madiman, 2014) , we can pose the following relationship H(Z Q ) ≤ H(Z Q + K) = H(Z Q ) + I(Z Q + K, K), where I(•, •) denotes the mutual information. Since Z Q is independent of function g and if σ 2 Z → 0, then I(Z Q + K, K) → 0 (see Appendix B. 2 for a proof), the contribution of the I(Z Q + K, K) term, with respect to training g, becomes negligible in the case of small σ 2 Z . Unfortunately, equation 13 is no longer valid when P d lies on a manifold since Z Q will be a singular random variable and the differential entropy H(Z Q ) is not defined. In Appendix B.3, we show that leaving out the entropy H(Z Q) corresponds to minimizing an upper bound of the spread KL divergence. To further find out the contribution of the negative entropy term, we compare between leaving out -H(Z Q) and approximating -H(Z Q) during training. In Appendix B.4, we discuss the approximation technique and give the empirical evidence which shows that the ignoring -H(Z Q) will not affect the training of our model. Therefore, we make use of volume preserving g and small variance σ 2 Z = 1 × 10 -4 in our experiments. In contrast to other volume preserving flows, that utilize a fixed prior P Z , our method affords additional flexibility by way of allowing for changes to the 'volume' of the prior towards matching the distribution of the target data. In this way, our decision to employ volume-preserving flow functions does not limit the expressive power or abilities of the model, in principle. Popular non-volume preserving flow structures, e.g. affine coupling flow, may also easily be normalized to become volume preserving, thus further extending the applicability of our approach (see Appendix C.1 for an example). Term 2: The noisy prior p(z) is defined to be a degenerate Gaussian N (0, AA T ), convolved with a Gaussian noise N (0, σ 2 Z I), and has a closed form density p(z) = N (0, AA T + σ 2 Z I). Therefore, the log density log p(z) is well defined, we can approximate term 2 by Monte Carlo q(z) log p(z)dz ≈ 1 N N n=1 log p(z n ), where q(z) = p(z|z)dQ Z . To sample from q(z), we first get a data sample x ∼ P d , use function g to get z = g(x) (so z is a sample of Q Z ) and finally sample z ∼ p(z|z).

7. EXPERIMENTS

To demonstrate the effectiveness of our approach, Sections 7.1-7.4 report experiments on four datasets; toy 2D and 3D data, the fading square dataset and a synthesized adaptation of MNIST. We use the Adam optimizer (Kingma and Ba, 2014) (2016) . We compare a volume preserving flow with a learnable prior (our method) to a non-volume preserving flow with fixed prior, so both models have the ability to adapt their 'volume' to fit the target distribution and retains fair comparisons. See Appendix C.1 for a detailed discussion of the incompressible affine coupling layer and the network structures we use for all the experiments.

7.1. 2D TOY DATA

We firstly verify our method using the toy dataset described in Section 2 and Figure 1a . The flow function has two coupling layers. We train our model using learning rate 3×10 -4 and batch size 100 for 10k iterations. Figure 3 shows the samples from the model, the learned prior and the eigenvalues of AA T . We observe the sample quality is better than that in Figure 1b and the prior has learned a degenerate Gaussian with Indim(P Z ) = 1, which matches Indim(P d ). In Appendix C.2, we show that our model can not only learn the manifold support of target distribution but also capture the 'density' allocation on the manifold. 

7.2. S-CURVE DATASET

We fit our model to the S-curve dataset shown in Figure 4a . The data distribution lies on a 2D manifold in a 3D space, therefore Indim(P d ) = 2. Specific network structure and training details can be found in Appendix C.1. After training, our model learns a nonlinear function g to transform P d to Q Z , where the latter lies on a 2D linear subspace in 3D space (see Figure 4b ). Following this, a linear dimensionality reduction can be conducted to generate 2D data representations, we now briefly outline a general procedure for this. For Q Z with Amdim(Q Z ) = D and Indim(Q Z ) = K, we first find the eigenvectors e 1 , . . . , e D of AA T , sorted by their eigenvalues. When Rank(AA T ) = K ≤ D, there exist K eigenvectors with positive eigenvalues. We select the first K eigenvectors and form the matrix E = [e 1 , . . . , e K ] with dimension D×K. We then transform each data sample x ∈ R D into Z space: z = g(x), such that z ∈ R D . Afterwards, a linear projection is carried out z proj = zE to obtain the lower dimensional representation z proj ∈ R K . This procedure can be seen as a nonlinear dimensionality reduction, where the nonlinear component is solved using the learned function g. We plot the resulting representations in Figure 4b . The colormap indicates correspondence between the data in 3D space and the representation in 2D space. We observe that our method can successfully (1) identify that the intrinsic dimension of the data is two and (2) project the data into a 2D space that faithfully preserves the structure of the original data distribution. We also compare the sample generation quality with a flow that has a fixed Gaussian prior, see Appendix C.3 for details. 

7.3. FADING SQUARE DATASET

The fading square dataset (Rubenstein et al., 2018) was proposed in order to diagnose model behavior when data distribution and model possess differing intrinsic dimension and therefore constitutes a further relevant test bed for our current work. The dataset consists of 32×32 pixel images with 6×6 grey squares on a black background. The grey scale values are sampled from a uniform distribution with range [0, 1], so Indim(P d ) = 1. Figure 5a shows the data samples. We fit our model to the dataset, the network structure and the training details can be found in Appendix C. Figure 5b shows samples from our trained model. Figure 5d shows the first 20 eigenvalues of the AA T (ranked from high to low), we can see only one eigenvalue is larger than zero and the others have converged to zero. This illustrates that we have successfully identified the intrinsic dimension of P d . We further carry out the dimensionality reduction process that was introduced in Section 7.2; the latent representation z is projected onto a 1D line and we visualize the correspondence between the projected representation and the data in Figure 5e . Pixel grey-scale values can be observed to decay as the 1D representation is traversed from left to right, indicating that our representations are consistent with the properties of the original data distribution. In contrast, we find that the traditional flow model, with a fixed 1024D Gaussian p(z), fails to learn such a data distribution, see Figure 5c . 

7.4. MNIST DATA

We further investigate training of our model using images of digits. However, for datasets like MNIST (LeCun, 1998), the true intrinsic dimension is unknown. In order to verify the correctness of our model's ability to identify the intrinsic data dimension, we construct synthetic datasets by first fitting an implicit model p θ (x) = δ(x -g(z)p(z))dz to the MNIST dataset, and then use samples from the trained model x ∼ p θ (x) as training data. The intrinsic dimension of the training dataset is the same as the dimension of the latent variable z in the implicit model Indim(P d ) = dim(z). We construct two datasets with dim(z) = 5 and dim(z) = 10 such that Indim(P d ) = 5 and Indim(P d ) = 10, respectively. Further details on the implicit model, flow network structure, training method and samples from the learned models are found in Appendix C. In contrast to the fading square dataset, we find that in order to successfully train the model (i.e. such that valid image samples are generated), it is necessary to add small Gaussian noise to the training data. This trick is commonly used in the training of flow based models for image data (Sorrenson et al., 2020) . We note that adding Gaussian noise breaks the assumption that the data lies on a manifold and, alternatively, the intrinsic dimension of the training data distribution will be equal to its ambient dimension. Towards alleviating this undesired effect, we firstly add Gaussian noise with standard deviation σ x = 0.05 and anneal σ x after 2000k iterations with a factor of 0.9 every 10k iterations. However, to help retain successful model training, we disallow the annealing procedure to reach a state where zero noise is added due to the outlined model behavior observed when considering this image dataset. Experimentally, we cap a lower-bound Gaussian noise level of 0.01 and leave further investigation of this phenomenon to future work. In Figure 6a and 6b, we plot the first 20 eigenvalues of the AA T (ranked from high to low) after training on two synthetic MNIST datasets with intrinsic dimension 5 and 10. It can be observed that 5 and 10 eigenvalues are significantly larger than the remaining values, respectively. It can thus be concluded that the intrinsic dimension of the two datasets are 5 and 10. Remaining non-zero eigenvalues can be attributed to the small Gaussian noise added to the training data. For the original MNIST dataset, it was shown that digits have different intrinsic dimension (Costa and Hero, 2006) . This suggests the distribution of MNIST may lie on several, disconnected manifolds with differing intrinsic dimension. Although our model assumes that P d lies on one continuous manifold, it is interesting to investigate the case when this model assumption is not fulfilled. We thus fit our model to the original MNIST data and plot the eigenvalues in Figure 6c . In contrast to Figures 6a and 6b, the gap between eigenvalues can be seen to be less pronounced, with no obvious step change. However the values suggest that the intrinsic dimension of MNIST is between 11 and 14. This result is consistent with previous estimations stating that the intrinsic dimension of MNIST is between 12 and 14 (Facco et al., 2017; Hein and Audibert, 2005; Costa and Hero, 2006) . Recent work Cornish et al. (2019) discusses fitting flow models to a P d that lies on disconnected components, by introducing a mixing prior. Such techniques may be easily combined with our method towards constructing more powerful flow models; a promising direction for future work. 

8. RELATED WORK

Classic latent variable generative models assume that data distributions lie around a low-dimensional manifold, for example the Variational Auto-Encoder (Kingma and Welling, 2013) or, recently introduced, Noisy Injective Flows (Cunningham et al., 2020) . Such methods typically assume that observational noise is not degenerated (e.g. a fixed Gaussian distribution), therefore the model distribution is absolutely continuous and maximum likelihood learning is thus well defined. However, common distributions such as natural images usually don't have Gaussian observational noise (Zhao et al., 2017) . Therefore, in this work, we focus on modeling distributions that lie on a low-dimensional manifold. The study of manifold learning for nonlinear dimensionality reduction Cayton (2005) and intrinsic dimension estimation Camastra and Staiano (2016) is a rich field with an extensive set of tools. However, most methods commonly do not model data density on the manifold and are thus not used for the same purpose as the models introduced here. There are however a number of recent works that introduced normalizing flows on manifolds that we now highlight and relate to our approach. Several works define flows on manifolds with prescribed charts. Gemici et al. (2016) generalized flows from Euclidean spaces to Riemannian manifolds by proposing to map points from the manifold M to R K , apply a normalizing flow in this space and then map back to M. The technique has been further extended to Tori and Spheres (Rezende et al., 2020) . These methods require knowledge of the intrinsic dimension K and a parameterization of the coordinate chart of the data manifold. Without providing a chart mapping a priori, M-flow (Brehmer and Cranmer, 2020) recently proposed an algorithm that learns the chart mapping and distribution density simultaneously. However, their method still requires that the dimensionality of the manifold is known. They propose that the dimensionality can be learned either by a brute-force solution or through a trainable variance in the density function. The brute-force solution is clearly infeasible for data embedded in extremely high dimensional space, as is often encountered in deep learning tasks. Use of a trainable variance is natural and similar to our approach. However, as discussed at the beginning of this paper, without carefully handling the KL or MLE term in the objective, a vanishing variance parameter will result in wild behavior of the optimization process since these terms are not well defined. While the GIN model considered in Sorrenson et al. (2020) could recover the low dimensional generating latent variables following their identifiability theorem, the assumptions therein require knowledge of an auxiliary variable, e.g. the label, which is not required in our model. Behind this superficial difference is the essential discrepancy between the concept of informative dimensions and intrinsic dimensions. The GIN model discovers the latent variables that are informative in a given context, defined by the auxiliary variable u instead of the true intrinsic dimensions. In their synthetic example, the ten dimensional data is a nonlinear transformation of ten dimensional latent variables where two out of ten are correlated with the labels of the data and the other eight are not. In this example, there are two informative dimensions, but there are ten intrinsic dimensions. Nevertheless, our method for intrinsic dimension discovery can be used together with informative dimension discovery methods to discover finer structures of data.

9. CONCLUSION

We presented a principled strategy to learn the data distribution that lies on a manifold and identify its intrinsic dimension. We fix the ill-defined KL divergence and show, across multiple datasets, the resulting benefits for flow based models under both sample generation and representation quality. There remain a number of open questions and interesting directions for future work. Namely; further exploration of the effects of the entropy term in the case of non-volume preserving networks and, additionally, investigation of the phenomenon concerning the apparent necessity of noise addition in cases pertaining to complex real-world distributions e.g. image data.

A MAXIMUM LIKELIHOOD ESTIMATION AND KL DIVERGENCE

Given data x 1 , x 2 , . . . , x N sampled independently from the true data distribution P d , with density function p d (x), we want to fit the model density p(x) 3 to the data. A popular choice to achieve this involves minimization of the KL divergence where: KL(p d (x)||p(x)) = p d (x) log p d (x)dx -p d (x) log p(x)dx (16) = -p d (x) log p Z (g(x)) + log det ∂g ∂x dx + const.. Since we can only access samples from p d (x), we approximate the integral by Monte Carlo sampling KL(p d (x)||p(x)) ≈ - 1 N N n=1 log p(x n ) + const.. Therefore, minimizing the KL divergence between the data distribution and the model is (approximately) equivalent to Maximum Likelihood Estimation (MLE). When p(x) is a flow-based model with invertible flow function f : Z → X, g = f -1 , minimizing the KL divergence in X space is equivalent to minimizing the KL divergence in the Z space. We let X d be the random variable of data distribution and define Z Q : Z Q = g(X d ) with density q(z), so q(z) can be represented as q(z) = δ(z -g(x))p d (x)dx. Let p(z) be the density of the prior distribution P Z , the KL divergence in Z space can be written as KL(q(z)||p(z)) = q(z) log q(z)dz Term 1 -q(z) log p(z)dz Term 2 . ( ) Term 1: using the properties of transformation of random variable (Papoulis and Pillai, 2002, pp. 660) , the negative entropy can be written as q(z) log q(z)dz = p d (x) log p d (x)dx const. -p d (x) log det ∂g ∂x dx. Term2: the cross entropy can be written as q(z) log p(z)dz = δ(z -g(x))p d (x) log p(z)dzdx (22) = p d (x) log p(g(x))dx. Therefore, the KL divergence in Z space is equivalent to the KL divergence in X space KL(q(z)||p(z)) = KL(p d (x)||p(x)). We thus build the connection between MLE and minimizing the KL divergence in Z space.

B ENTROPY B.1 AN EXAMPLE

Assume a 2D Gaussian random variable X with covariance 1 0 0 1 , g is a volume preserving flow with parameter θ. Let Z 1 = g θ1 (Z) be a Gaussian with covariance 2 0 0 1 2 and Z 2 = g θ2 (X) be 3 For simplicity, we use notation p(x) to represent the model pX (x) unless otherwise specified. a Gaussian with covariance 3 0 0 1 3 . Therefore the entropy H(Z 1 ) = H(Z 2 ) = H(X) and doesn't not depend on θ. Let K be an Gaussian with zero mean and covariance 1 0 0 1 , so Z 1 + K is a Gaussian with covariance 3 0 0 3 2 and Z 2 + K is a Gaussian with covariance 4 0 0foot_2 

3

. Therefore H(Z 1 + K) = H(Z 2 + K) and H(g θ (X) + K) depends on θ. A similar example can be constructed when X is not absolutely continuous.

B.2 Z IS AN ABSOLUTELY CONTINUOUS RANDOM VARIABLE

For two mutually independent absolutely continuous random variable Z and K, the mutual information between Z + K and K is I(Z + K, K) = H(Z + K) -H(Z + K|K) (25) = H(Z + K) -H(Z|K) (26) = H(Z + K) -H(Z). ( ) The last equality holds because Z and K are independent. Since mutual information I(Z +K, K) ≥ 0, we have H(Z) ≤ H(Z + K) = H(Z) + I(Z + K, K). ( ) Assume K has a Gaussian distribution with 0 mean and variance σ 2 Z 4 . When σ 2 Z = 0, K degenerates to a delta function, so Z + K = Z and I(Z + K, K) = H(Z + K) -H(Z) = 0, this is because the mutual information between an a.c. random variable and a singular random variable is still well defined, see (Yeung, 2008, Theorem 10.33) . Assume K 1 , K 2 are Gaussian random variables with 0 mean and variances σ 2 1 and σ 2 2 respectively. Without loss of generality, we assume σ 2 1 > σ 2 2 and σ 2 1 = σ 2 2 + σ 2 δ , and let K δ be the random variable of a Gaussian that has 0 mean and variance σ 2 δ such that K 1 = K 2 + K δ . By the data-processing inequality, we have I(Z + K 2 , K 2 ) ≤ I(Z + K 2 + K δ , K 2 + K δ ) = I(Z + K 1 , K 1 ). (30) Therefore, I(Z + K, K) is a monotonically decreasing function when σ 2 Z decreases and when σ 2 Z → 0, I(Z + K, K) → 0.

B.3 UPPER BOUND OF THE SPREAD KL DIVERGENCE

In this section, we show that leaving out the entropy term H(Z Q) in equation 12 is equivalent to minimizing an upper bound of the spread KL divergence. For singular random variable Z Q = g(X d ) and absolutely continuous random variable K that are independent, we have H(Z Q + K) -H(K) = H(Z Q + K) -H(Z Q + K|Z Q ) (31) = I(Z Q + K, Z Q ) ≥ 0. (32) The second equation is from the definition of Mutual Information (MI); the MI between an a.c. random variable and a singular random variable is well defined and always positive, see (Yeung, 2008, Theorem 10.33 ) for a proof. Therefore, we can construct an upper bound of the spread KL objective in equation 12 KL(q||p) = q(z) log q(z)dz -H(Z Q +K) -q(z) log p(z)dz ≤ -H(K) const. -q(z) log p(z)dz. Therefore, ignoring the negative entropy term during training is equivalent to minimizing an upper bound of the spread KL objective.

B.4 EMPIRICAL EVIDENCE FOR IGNORING THE NEGATIVE ENTROPY

In this section, we first introduce the approximation technique to compute the negative entropy term, and then discuss the contribution of this term for the training. The negative entropy of random variable Z Q is -H(Z Q) = q(z) log q(z)dz, where q(z) = z p K (z -z)dQ Z , and p K is the density of a Gaussian with diagonal covariance σ 2 Z I. We first approximate q(z) by a mixture of Gaussians q(z) ≈ 1 N N n=1 N (z; z n , σ 2 Z I) ≡ qN (z) where z n is the nth sample from distribution Q z by first sampling x n ∼ P d and letting z n = g(x n ). We denote the random variable of this Gaussian mixture as ẐN Q and approximate -H(Z Q) ≈ -H( ẐN Q ). However, the entropy of a Gaussian mixture distribution does not have a closed form, so we further conduct a first order Taylor expansion approximation (Huber et al., 2008 ) -H( ẐN Q ) ≈ 1 N N n=1 log qN (z = z n ), this approximation is accurate for small σ 2 Z . Finally we have our approximation -H(Z Q) ≈ 1 N N n=1 log qN (z = z n ). To evaluate the contribution of the negative entropy, we train our flow model on both low dimensional data (Toy datasets: 2D) and high dimensional data (Fading square dataset: 1024D) by optimization that uses two objectives: (1) ignoring the negative entropy term in equation 7 and (2) approximating the negative entropy term in equation 7 using the approximation discussed above. During training, we keep tracking of the value of the entropy H(Z Q) (using the approximation value) in both objectives. We let N equal the batch size when approximating the entropy. Additional training details remain consistent with those described in Appendix C. To evaluate the contribution of the negative entropy, we train our flow model on both low dimensional data (Toy datasets: 2D) and high dimensional data (Fading square dataset: 1024D) by optimization that uses two objectives: (1) ignoring the negative entropy term in equation 7 and (2) approximating the negative entropy term in equation 7 using the approximation discussed above. During training, we keep tracking of the value of the entropy H(Z Q) (using the approximation value) in both objectives. We let N equal the batch size when approximating the entropy. Additional training details remain consistent with those described in Appendix C. In Figure 7 , we plot the (approximated) entropy value H(Z Q) during training for both experiments. We find the difference between having the (approximated) negative entropy term and ignoring the negative entropy to be negligible. We leave theoretical investigation on the effects of leaving out the entropy term during training to future work. 

C.3 S-CURVE DATASET

To fit our model to the data, we use the Adam optimizer with learning rate 5 × 10 -4 and batch size 500 and train the model for 200k iterations. We anneal the learning rate with a factor of 0.9 every 10k iterations. We compare our method with a traditional normalizing flow with a fixed 3D Gaussian prior. Both models have the same network architecture and training procedure. Figure 10a and 10b show the samples form our model and the traditional flow with a fixed 3D Gaussian prior. We can observe more samples lying outwith the true data distribution in Figure 10b than in Figure 10a . We can conclude that our model has better sample quality considering this S-Curve dataset. We also compare the latent representation for both models, see Figure 10c and 10d . We can see the representation distribution Q Z captures the structure of the data distribution well whereas the distribution Q Z in Figure 10d is unable to do so.

C.4 FADING SQUARE DATASET

To fit the data, we train our model for 20k iterations with batch-size 100 using the Adam optimizer. The learning rate is initialized to 5×10 -4 and decays with a factor of 0.9 at every 1k iterations. We additionally use an L2 weight decay with factor 0.1.

C.5.1 IMPLICIT DATA GENERATION MODEL

To fit an implicit model to the MNIST dataset, we first train a Variational Auto-Encoder (VAE) (Kingma and Welling (2013) ) with Gaussian prior p(z)=N (0, I). The encoder is q(z|x) = N (µ θ (x), Σ θ (x)) where Σ is a diagonal matrix. Both µ θ and Σ θ are parameterized by a 3-layer The activation of the hidden output uses a ReLU and we utilize a Sigmoid function in the final layer output to constrain the output between 0 and 1. The training objective is to maximize the lower bound of the log likelihood log p(x) ≥ q(z|x) log p(x|z)dz -KL(q(z|x)||p(z)), see Kingma and Welling (2013) for further details. We use an Adam optimizer with learning rate 1 × 10 -4 and batch size 100 to train the model for 100 epochs. After training, we sample from the model by first taking a sample z ∼ p(z) and letting x = g θ (z). This is equivalent to taking a sample from an implicit model p θ (x) = δ(x -g θ (z))d(z)dz, see Tolstikhin et al. (2017) for further discussion regarding this implicit model construction. In Figure 11 , we plot samples from the trained implicit model with dim(z) = 5, dim(z) = 10 and the original MNIST data.

C.5.2 FLOW MODEL TRAINING

We train our flow models to fit the synthetic MNIST dataset with intrinsic dimensions 5 and 10 and the original MNIST dataset. In all models, we use the Adam optimizer with learning rate 5 × 10 -4 and batch size 100. We train the model for 3000k iterations and, following the initial 1000k iterations, the learning rate decays every 10k iterations by a factor 0.9. In Figure 12 , we plot the samples from our models trained on three different training datasets. In Figure 13 , we also plot the samples from traditional flow models that trained on these three datasets using the same experiment settings , we found the samples from our models are sharper than that from traditional flow models. 



We use the generalized density to include the case that AA T is not full rank. The KL divergence KL(Q||P) is well defined when Q and P have valid densities and their densities have the same support(Ali and Silvey, 1966). The extension to higher dimensions is straightforward.



Figure 1: Samples and latent visualization from a flow based model with a fixed Gaussian prior when the intrinsic dimension is strictly lower than the true dimensionality of the data space.

in all our experiments. Our flow networks are consisted of incompressible affine coupling layer introduced by Sorrenson et al. (2020); Dinh et al.

Figure 3: (a) shows the samples from the data distribution P d and our model P X . (b) shows the sample from the learned prior P Z and the distribution Q Z . (c) shows the eigenvalues of AA T .

Figure 4: (a) S-curve data samples x ∼ P d . (b) The latent representation z = g(x), points can be observed to lie on a linear subspace. (c) Eigenvalues of the matrix AA T , we deduce that Indim(P d ) = 2. (d) Our representation after the dimensionality reduction z proj = zE.

Figure 5: (a) and (b) show the samples from the data distribution and our model, respectively. (c) shows a traditional flow based model with a fixed Gaussian prior fails to generate any valid samples. (d) shows the first 20 eigenvalues of the matrix AA T . (e) visualization of the representation after applying dimensionality reduction. See text for further discussion.

Figure 6: This figure shows the eigenvalues of the AA T after fitting the (a) synthetic MNIST with intrinsic dimension 5; (b) synthetic MNIST with intrinsic dimension 10; (c) original MNIST dataset.

Figure 8: (a) shows the samples from the data distribution. (b) and (d) show samples from a flow with a fixed Gaussian prior and our method, respectively. (c) and (d) show the latent space in both models. In (f), we plot the eigenvalues of the matrix AA T .

Figure 9: (a) and (c) shows the ground truth 'density allocation' on the manifold for two toy datasets, (b) and (d) shows the 'density allocation' learned by our models.

Figure 10: Figure (a) and (b) plot the data distribution P Z of the S-Curve dataset and the samples from our model and a traditional flow with a fixed Gaussian prior. Figure (c) shows the representation distribution Q Z and the learned prior P Z of our model. In Figure (d), we plot the representation distribution Q Z using the flow with a fixed Gaussian prior.

Figure 11: Training data for the flow model.Figure (a) and (b) are synthetic MNIST samples from two implicit models with latent dimension 5 and 10.Figure (c) are samples from the original MNIST dataset.

Figure 12: Samples from our methods. Figure (a) and (b) are samples flow models trained on synthetic MNIST data with intrinsic dimension 5 and 10. Figure (c) are samples from a flow model that trained on original MNIST data.

Figure 13: Samples from traditional non-volume preserving flow models with fixed Gaussian prior.

C EXPERIMENTS C.1 NETWORK ARCHITECTURE

The flow network we use consists of incompressible affine coupling layers (Sorrenson et al., 2020; Dinh et al., 2016) . Each coupling layer splits a D-dimensional input x into two parts x 1:d and x d+1:D . The output of the coupling layer iswhere s : R d → R D-d and t : R d → R D-d are scale and translation functions parameterized by neural networks, is the element-wise product. The log-determinant of the Jacobian of a coupling layer is the sum of the scaling function j s(x 1:d ) j . To make the coupling transform volume preserving, we normalize the output of the scale function, so the i-th dimension of the output isand the log-determinant of the Jacobian is i s(x 1:d ) i = 0. We compare a volume preserving flow with a learnable prior (normalized s(•)) to a non-volume preserving flow with fixed prior (unnormalized s(•)). In this way both models have the ability to adapt their 'volume' to fit the target distribution, retaining comparison fairness.In our affine coupling layer, the scale function s and the translation function t have two types of structure: fully connected net and convolution net. Each fully connected network is a 4-layer neural network with hidden-size 24 and Leaky ReLU with slope 0.2. Each convolution net is a 3-layer convolutional neural network with hidden channel size 16, kernel size 3×3 and padding size 1. The activation is Leaky ReLU with slope 0.2. The downsampling decreases the image width and height by a factor of 2 and increases the number of channels by 4 in a checkerboard-like fashion (Sorrenson et al., 2020; Jacobsen et al., 2018) . When multiple convolutional nets are connected together, we randomly permute the channels of the output of each network except the final one. In 

C.2 TOY DATASET

We also construct a second dataset and train a flow model using the same training procedure discussed in Section 7.1. Figure 8a shows the samples from the data distribution P d , each data point is a 2D vector x = [x 1 , x 2 ] where x 1 ∼ N (0, 1) and x 2 = x 1 , so Indim(P d ) = 1. Figure 8f shows that the prior P Z has learned the true intrinsic dimension Indim(P Z ) = Indim(P d ) = 1. We compare to samples drawn from a flow model that uses a fixed 2D Gaussian prior, with results shown in Figure 8 . We can observe, for this simple dataset, flow with a fix Gaussian prior can generate reasonable samples, but the 'curling up' behavior, discussed in the main paper, remains highly evident in the Z space, see Figure 8c .We also plot the 'density allocation' on the manifold for the two toy datasets. For example, for the data generation process x = [x 1 , x 2 ] where x 1 ∼ p = N (0, 1) and x 2 = x 1 , we use the density value p(x = x 1 ) to indicate the 'density allocation' on the 1D manifold. To plot the 'density allocation' of our learned model, we first sample x s uniformly from the support of the data distribution, the subscript 's' here means that they only contain the information of the support. Specifically, since x s = [x s 1 , x s 2 ], we sample x s 1 ∼ p = U(-3σ, 3σ) (σ is the standard deviation of N (0, 1), U is the uniform distribution) and let x s 2 = x s 1 or x s 2 = sin(x s 1 ), depending on which dataset is used. We use the projection procedure that was described in Section 7.2 to obtain the projected samples z proj s , so z proj s ∈ R Indim(P d ) . We also project the learned prior P Z to R Indim(P d ) by constructing P proj Z as a Gaussian with zero mean and a diagonal covariance contains the non-zeros eigenvalues of AA T . Therefore P proj Z is a.s. in R Indim(P d ) and we denote its density function as p proj (z). We then use the density value p proj (z = z proj s ) to indicate the 'density allocation' at the location of x s on the manifold support. In Figure 9 , we compare our model with the ground truth and find that we can successfully capture the 'density allocation' on the manifold.

