DENSITY ESTIMATION ON LOW-DIMENSIONAL MANI-FOLDS: AN INFLATION-DEFLATION APPROACH

Abstract

Normalizing Flows (NFs) are universal density estimators based on Neuronal Networks. However, this universality is limited: the density's support needs to be diffeomorphic to a Euclidean space. In this paper, we propose a novel method to overcome this limitation without sacrificing the universality. The proposed method inflates the data manifold by adding noise in the normal space, trains an NF on this inflated manifold and, finally, deflates the learned density. Our main result provides sufficient conditions on the manifold and the specific choice of noise under which the corresponding estimator is exact. Our method has the same computational complexity as NFs, and does not require to compute an inverse flow. We also show that, if the embedding dimension is much larger than the manifold dimension, noise in the normal space can be well approximated by some Gaussian noise. This allows using our method for approximating arbitrary densities on non-flat manifolds provided that the manifold dimension is known.

1. INTRODUCTION

Many modern problems involving high dimensional data are formulated probabilistically. Key concepts, such as Bayesian Classification, Denoising or Anomaly Detection, rely on the data generating density p * (x). Therefore, a main research area and of crucial importance is learning this data generating density p * (x) from samples. For the case where the corresponding random variable X ∈ R D takes values on a manifold diffeomorphic to R D , a Normalizing Flow (NF) can be used to learn p * (x) exactly (Huang et al., 2018) . Recently, a few attempts have been made to overcome this topological constraint. However, to do so, all of these methods either need to know the manifold beforehand (Gemici et al. (2016) , Rezende et al. (2020) ) or they sacrifice the exactness of the estimate (Cornish et al. (2019) , Dupont et al. (2019) ). Our goal in this paper is to overcome both the aforementioned limitations of using NFs for density estimation on Riemannian manifolds. Given data points from a d-dimensional Riemannian manifold embedded in R D , d < D, we first inflate the manifold by adding a specific noise in the normal space direction of the manifold, then train an NF on this inflated manifold, and, finally, deflate the trained density by exploiting the choice of noise and the geometry of the manifold. See Figure 1 for a schematic overview of these points. Our main theorem states sufficient conditions on the manifold and the type of noise we use for the inflation step such that the deflation becomes exact. To guarantee the exactness, we do need to know the manifold as in e.g. Rezende et al. (2020) because we need to be able to sample in the manifold's normal space. However, as we will show, for the special case where D d, the usual Gaussian noise is an excellent approximation for a noise in the normal space component. This allows using our method for approximating arbitrary densities on Riemannian manifolds provided that the manifold dimension is known. In addition, our method is based on a single NF without the necessity to invert it. Hence, we don't add any additional complexity to the usual training procedure of NFs. Notations: We denote the determinant of the Gram matrix of f as g f (x) := | det J f (x) T J f (x) | where J f (x) is the Jacobian of f . We denote the Lebesque measure in R n as λ n . Random variables will be denoted with a capital letter, say X, and its corresponding state space with the calligraphical ←-X ⊂ R d 2 F -1 θ R D x ∼ p * (x) x x ∼ q(x) +σ 2 R D ←-X ⊂ R D R D = U x u = F -1 θ (x) -σ 2 Theorem 1 ←-X ⊂ R d R D x ∼ p(x) x 3 4 1 u ∼ p U (u) Figure 1 : Schematic overview of our method. 1. A density p * (x) with support on a d-dimensional manifold X (top left) is inflated by adding noise σ 2 in the normal space (top right). 2. We have an NF F -foot_0 θ (x) learn this inflated density q(x) using a well-known reference measure p U (u). 3. We deflate the learned density to obtain an estimate p(x) for p * (x). 4. Our main result provides sufficient conditions for the manifold X and the choice of noise such that p(x) = p * (x). version, i.e. X . Small letters correspond to vectors with dimensionality given by context. The letters d, D, n, and N are always natural numbers.

2. BACKGROUND AND PROBLEM STATEMENT

An NF transforms a known auxiliary random variable by using bijective mappings parametrized by Neuronal Networks such that the given data points are samples from this transformed random variable, see Papamakarios et al. (2019) . Formally, an NF is a diffeomorphism F θ : U → X and induces a density on X through p θ (x) = (g F (u) -1 2 p U (u) where p U (u) is known and u = F -1 θ (x). The parameters θ are updated such that the KL-divergence between p * (x) and p θ (x), D KL (p * (x)||p θ (x)) = E x∼p * (x) [log p θ (x)] + const. (1) is minimized. If F θ is expressive enough, it was proven that in the limit of infinitely many samples, updating θ to minimize this objective function converges to a θ * such that it holds P X -almost surely p * (x) = p θ * (x), see (Huang et al., 2018) . More generally, let X ∈ X ⊂ R D be generated by an unobserved random variable Z ∈ Z ⊂ R d with density π(z), that is X = f (Z) for some function f : Z → X where typically d < D. In Gemici et al. ( 2016), f is an embedding 1 , and it was shown that one can calculate probabilities such as P X (A) for measurable A ⊂ X using a density p * (x) with respect to the volume form dV f induced by f , that is P(X ∈ A) = f -1 (A) π(z)dz = A p * (x)dV f (x) (2) with p * (x) = π(z)g f (z) -1 2 and dV f (x) = g f (z)dz where z = f -1 (x). Hence, given an explicit mapping f and samples from p * (x), we can learn the unknown density π(z) using a usual NF in R d . However, in general, the generating function f is either unknown or not an embedding creating numerical instabilities for training inputs close to singularity points. In Brehmer & Cranmer (2020) , f and the unknown density π are learned simultaneously. The main idea is to define f as a level set of a usual flow in R D and train it together with the flow in R d used to learn π(z). To evaluate the density, one needs to invert f and thus this approach may be very slow for high-dimensional data. Besides, to guarantee that f learns the manifold they proposed several ad hoc training strategies. We tie in with the idea to use an NF for learning p * (x) with unknown f and study the following problem. Problem 1 Let X be a d-dimensional manifold embedded in R D . Let X = f (Z) be a random variable generated by an embedding f : R d → R D and a random variable Z ∼ π(z) in R d . Given N samples from p * (x) as described above, find an estimator p of p * such that in the limit of infinitely many samples we have that p(x) = p * (x), P Xalmost surely.

3. METHODS

To solve Problem 1, we want to exploit the universality of NFs. We want to inflate X such that the inflated manifold X becomes diffeomorphic to a set U on which a simple density exists. Doing so, allows us to learn the inflated density p(x) exactly using a single NF, see Section 2. Then, given such an estimator for the modified density, we approximate p * (x) and give sufficient conditions when this approximation is exact.

3.1. THE INFLATION STEP

Given a sample x of X, if we add some noise ε ∈ R D to it, the resulting new random variable X = X + ε has the following density q(x) = X q(x|x)dP X (x). (3) Denote the tangent space in x as T x and the normal space as N x . By definition, N x is the orthogonal complement of T x . Therefore, we can decompose the noise ε into its tangent and normal component, ε = ε t + ε n . In the following, we consider noise in the normal space only, i.e. ε t = 0, and denote the density of the resulting random variable as q n (x). The corresponding noise density q n (x|x) has mean x and domain N x . We denote the support of q n (•|x) by N qn(•|x) . The random variable X = X + ε n is now defined on X = x∈X N qn(•|x) . We want X to be diffeomorphic to a set U on which a known density can be defined. Example 1 (a) Let X = S 1 = {x ∈ R 2 | ||x|| = 1} be the unit circle. For each x ∈ S 1 there exists z ∈ [0, 2π) such that x = e r (x) = (cos(z), sin(z)) T . To sample a point x in N x , which is spanned by e r (x), we sample a scalar value γ and set x = x + γe r (x). With γ ∼ Uniform[-1, 1), we have that X = x∈X {x + γe r (x)|γ ∈ [-1, 1)} = {x ∈ R 2 | ||x|| 2 < 2} (4) which is the open disk with radius 2. The open disk is diffeomorphic to (0, 1) × (0, 1). Thus, q n (x) can be learned by a single NF F -1 and p U (u) = Uniform ((0, 1) × (0, 1)) as reference. (b) As in (a), we consider the unit circle. Now we set γ to be a χ 2distribution with support [-1, ∞). Then X = x∈X {x + γe r (x)|γ ∈ [-1, ∞)} = R 2 . ( ) Thus, q n (x) can be learned by a single NF F -1 and p U (u) = N (u; 0, I D ) as reference. Both cases can be analogously extended to higher dimensions. Remark 1 To be precise, the random variable ε n is generated by a random variable in R D-d , say Γ, with measure P Γ . Then, q n (x|x) is the density of the pushforward of the noise measure P Γ with regard to the mapping h : R D-d → N x . Hence, formally, the density q(x|x) is with respect to the induced volume form dV h , see Section 2. However, if we choose an orthonormal basis for N x , say n (1) , . . . , n (D-d) , then we have that x = h(γ) = Aγ + x where the columns of A ∈ R D× (D-d) are given by these basis vectors, i.e. A = [n (1) , . . . , n (D-d) ]. Thus, the Gram determinant of h is g h = det(A T A) = 1 and we have that dV h (x) = dγ where dγ denotes the volume form with respect to λ D-d . In this case, we can think of q n (x|x) as a density with respect to λ D-d . If q n (x|x) depends only on ||x -x||, as it is for the Gaussian distribution, we have that q n (x|x) = q n (||x -x||) = q n (||γ||) because h is an isomorphism. Thus, for this case it holds that q n (x|x)dV h (x) = q n (||γ||)dγ. Then, for convenience, we may abuse notation by writing γ ∼ q(x|x) or ε n ∼ r(γ) where r(γ) is the density of P Γ with respect to λ D-d .

3.2. THE DEFLATION STEP

Our main idea is to find conditions such that q n (x) = q n (x|x)p * (x) for almost surely all x ∈ X and for an almost surely unique x ∈ X . Because then, given an exact estimator of q n (x), say qn (x), we have for x = x that p * (x) = qn (x)/q n (x|x). For equation ( 6) to be true, we need to guarantee that almost every x corresponds to only one x ∈ X . This is certainly the case whenever all the normal spaces have no intersections at all (think of a simple line in R 2 ). We can relax this assumption by allowing null-set intersections. Moreover, only those subsets of the normal spaces are of interest which are generated by the specific choice of noise q n (x|x). Thus, only the support of q n (x|x), N qn(•|x) , matters. The key concept for our main result is expressed in the following definition: Definition 1 Let X be a d-dimensional manifold and N x the normal space in x ∈ X . Let q n (•|x) be a density defined on N x and denote by N qn(•|x) the domain of q n (•|x). Denote the collection of all such densities as Q := {q n (•|x)} x∈X . For x ∈ X , we define the set of all possible generators of x as A(x) = {x ∈ X |N qn(•|x ) x}. We say X is Q-normally separated if for all x ∈ X holds that P X|X=x [x ∈ N x |#A(x) > 1] = 0 where #A(x) is the cardinality of the set A(x). In words, every x ∈ N x is P X|X=x -almost surely determined by x. To familiarize with this concept, consider Figure 2 and the following example: Example 2 For the circle in example 1, we choose ε n to be uniformly distributed on the half-open interval [-1, 1). The point (0, 0) T is contained in N qn(•|x) for all x ∈ X and thus N qn(•|x ) ∩ N qn(•|x) = {(0, 0) T } for all x = x , see Figure 2 (middle). Hence, for any given x ∈ N x we have that A(x) = X if x = (0, 0) T , x else, and therefore #A(x) = ∞ if x = (0, 0) T , 1 else. Thus, P X|X=x x ∈ X |#A(x) > 1 = P X|X=x x = (0, 0) T = 0 for all x ∈ X . What follows is that X is Q-normally seperated. If we were to choose ε n to be uniformly distributed on [-1.5, 1), see Figure 2 (right), the normal spaces would overlap and we would have that P X|X=x x ∈ X |#A(x) > 1 > 0. In this case, X would not be Q-normally seperated. 2 1 0 1 2 2 1 0 1 2 n Uniform[ 0.5, 1) 2 1 0 1 2 2 1 0 1 2 n Uniform[ 1, 1) 2 1 0 1 2 2 1 0 1 2 n Uniform[ 1.5, 1) Figure 2: Q-normal separability for different noise distributions q n (x|x) used to inflate X = S 1 (black line) . Left: X is Q-normally separable since every point in the inflated space X (red shaded area) has a unique generator. Middle: X is Q-normally separable since P X -almost every point in X has a unique generator. Right: X is not Q-normally separable since every point in the dark shaded area has two generators. Theorem 1 Let X be a d-dimensional manifold. For each x ∈ X , let q n (•|x) denote a distribution on the normal space of x. Let X be Q-normally separated where Q := {q n (•|x)} x∈X . Assume that we can learn the density q n (x) as defined in equation (3), by using a single NF F -1 , thus q n (x) = (g F (F -1 (x))) -1 2 p U (F -1 (x)) for some known density p U . Then, for P X -almost all x ∈ X holds that q n (x) = p * (x)q n (x|x), therefore this equation, when evaluated at x = x, yields p * (x) = q n (x) q n (x|x) . ( ) The proof can be found in Appendix A.1.

3.3. GAUSSIAN NOISE AS NORMAL NOISE AND THE CHOICE OF σ 2

Our proposed method depends on three critical points. First, we need to be able to sample in the normal space of X . Second, we need to determine the magnitude and type of noise. Third, we need to make sure that the conditions of Theorem 1 are fulfilled. We address (partially) those three points. 1. For the special case where D d, we show that a full Gaussian noise is an excellent approximation for a Gaussian noise restricted to the normal space. Consider ε = ε t + ε n , ε ∼ N (0, σ 2 I D ). Then, the expected absolute squared error when approximating normal noise with full Gaussian noise is E |ε -ε n | 2 = E |ε t | 2 = dσ 2 . The expected relative squared error is therefore E |ε -ε n | 2 |ε n | 2 = E |ε t | 2 |ε n | 2 = dσ 2 E 1 |ε n | 2 = d D -d -2 (8) because ε t and ε n are independent and D-d |εn| 2 follows a scaled inverse χfoot_1 -distribution with D -d degrees of freedom and scale parameter 1/σ 2 . Thus, if D d, Gaussian noise is an excellent approximation for a Gaussian in the normal space. We denote the inflated density with Gaussian noise by q σ 2 (x) in the following. 2. The inflation must not garble the manifold too much. For instance, adding Gaussian noise with magnitude σ ≥ r to S 1 will blur the circle. Since the curvature of the circle is 1/r, intuitively, we want σ to scale with the second derivative of the generating function f . Additionally, we do not want to lose the information of p * (x) by inflating the manifold. If the generating distribution π(z) makes a sharp transition at z 0 , π(z 0 -∆z o ) π(z 0 + ∆z o ) for |∆z o | 1, adding to much noise in x 0 = f (z 0 ) will smooth out that transition. Hence, we want σ to inversely scale with π (z). We formalize these intuitions in Proposition 1 and prove them in Appendix A.2. In accordance with Theorem 1, we say p σ (x) approximates well p * (x) if lim σ 2 →0 p σ (x)/q n (x|x) = p * (x) for all x ∈ X where q n (x|x) is the normalization constant of a (D-d)-dimensional Gaussian distribution. Proposition 1 Let X ∈ R D be generated by Z ∼ π(z) through an embedding f : R d → R D , i.e. f (Z) = X. Let π ∈ C 2 (R d ). For q σ 2 (x) to approximate well p * (x), in the sense that lim σ 2 →0 q σ 2 (x)/q n (x|x) = p * (x) for x ∈ X , a necessary condition is that: σ 2 2π(z 0 ) ||π (z 0 ) (J T f J f ) -1 || + 1 where ||A|| + = d i,j=1 A ij for A ∈ R d × R d and denotes the elementwise product, and (π (z 0 )) ij = ∂ 2 π(z) ∂zi∂zj | z=z0 is the Hessian of the prior evaluated at z 0 = f -1 (x). Intuitively, a second necessary condition is that the noise magnitude should be much smaller than the radius of the curvature of the manifold which directly depends on the second-order derivatives of f . This can be illustrated in the following example: Example 3 For the circle 2 in R 2 generated by f (z) = (cos(z), sin(z)) T and a von Mises distribution π(z) ∝ exp(κ cos(z)), we get that σ 2 min 2r 2 κ(κ sin 2 (z)-cos(z)) , r 2 where the first condition comes from Proposition 1 and the second one comes from the curvature argument. Even though this bound may not be usefull as such in practice when f and π are unknown, it can still be used if f and π are estimated locally with nearest neighbor statistics. From a numerical perspective, inflating a manifold using Gaussian noise circumvents degeneracy problems when training an vanilla NF for low-dimensional manifolds. In particular, the flows Jacobian determinant becomes numerically unstable, see equation ( 1). This determinant is essentially a volume changing factor for balls. From a sampling perspective, these volumes can be estimated with the number of samples falling into the ball divided by the total number of points. Therefore, we suggest to lower bound σ with the average nearest neighbor obtained from the training set to make sure that these volumes are not empty and thus avoid numerical instabilities. 3. Intuitively, if the curvature of the manifold is not too high and if the manifold is not too entangled, Q-normal separability is satisfied for a sufficiently small magnitude of noise. Also in the manifold learning literature, the entangling must not be too strong. Informally, the reach number provides a necessary condition on the manifold such that it is learnable through samples, see Chapter 2.3 in Berenfeld & Hoffmann (2019) . Formally, the reach number is the maximum distance τ X such that for all x in a τ X -neighbourhood of X the projection onto X is unique. In Appendix A.3 we prove Theorem 2 which states that any closed manifold X with τ X > 0 is Q-normally separable. Theorem 2 Let X ⊂ R D be a closed d-dimensional manifold. If X has a positive reach number τ X , then X is Q-normally separable where Q := {q n (•|x)} x∈X is the collection of uniform distributions on a ball with radius τ X , i.e. q n (x|x) = Uniform(x; B(x, τ X ) ∩ N x ) where B(x, τ X ) denotes a D-dimensional ball with radius τ X and center x.

4. RESULTS

We have three goals in this section: First, we numerically confirm the scaling factor in equation ( 7). Second, we verify that Gaussian noise can be used to approximate a Gaussian noise restricted to the normal space. Third, we numerically test our bound for σ 2 derived in Section 3.3. Finally, we show that we can learn complicated distributions on S 2 without using explicit charts. For training details, we refer to Appendix B.1 and B.2, respectively.

4.1. VON MISES ON A CIRCLE

Let X be a circle with radius 3 and let π(z) ∝ exp(8 cos(z)) be a 1D von Mises distribution. Given z ∼ π(z), we generate a point in X according to the mapping f (z) = 3(cos(z), sin(z)). We want to learn the induced density p * (x). Note that 3p * (x) = π(z) since 1/3 is the volume form induced by f . To benchmark our performance, we use the idea in Gemici et al. (2016) to first embed the circle into R, using e.g. f -1 , learn the density there with an NF, and transform this learned density back to S 1 . In Brehmer & Cranmer (2020) , this method is named Flow on manifolds (FOM) and we stick to this notation in the following. Note that f is not injective and to illustrate the benefit of our method we choose the singularity point to be (1, 0) T . By moving points close to (1, 0) T slightly away from (1, 0) T , we numerically ensure that f is an embedding. 1. The Inflation step: We inflate X using 3 types of noise: Gaussian in the normal space (NG), Gaussian in the full ambient space (FG), and χ 2 -noise as described in examples 1(b) with scale parameter 3. Technically, NG violates the Q-normal separability assumption. However, if σ 2 is small and the scale parameter for the von Mises distribution is large enough, this is practically fulfilled. 2. Learning q n (x): For the NFs we use a Block Neural Autoregressive Flow (BNAF) (De Cao et al., 2019) . We use the same NF architecture and training procedures across all models. 3. Deflation: Given an estimator for q n (x), we use equation ( 7) to calculate p * (x). For FG and NG, we have that q n (x|x) = 1/ √ 2πσ 2 and for the normal χ 2 -noise is q n (x|x) = √ 3e -3/2 /( √ 8Γ( 3 2 )). In Figure 3 , we show the results for σ 2 = 0.01 and σ 2 = 1. In the respective plot, the first row shows training samples from the inflated distributions q σ 2 (x) (left), and q n (x) (middle), respectively. We color code a sample x = x + ε according to p * (x) to illustrate the impact of noise on the inflated density. Note that the FOM model (top right) does not need any inflation and therefore is trained on samples from p * (x) only. In the respective plot, the second row shows the learned density for the different models and compares it to the ground truth von Mises distribution π(z) depicted in black. As we can see, for σ 2 = 0.01 all models perform very well, although the FOM model slightly fails to capture p(z) for z close to 0 which corresponds to the chosen singularity point. For σ 2 = 1, we see a significant drop in the performance of the Gaussian model. Although the manifold is significantly disturbed, the normal noise model still learns the density almost perfectlyfoot_2 , so does the normal χ 2 -noise model, as predicted by Theorem 1. To measure the dependence of our method on the magnitude of noise, we iterate this experiment for various values of σ 2 and estimate the Kolmogorov-Smirnov (KS) statistics. The KS statistic is defined as KS = sup x∈X |F (x) -G(x)|, where F and G are the cumulative distribution functions associated with the probability densities p(x) and q(x), respectively. By definition, KS ∈ [0, 1] and KS = 0 if and only if p(x) = q(x) for almost every x ∈ X . However, equation ( 6) is only valid if the conditions of Theorem 1 are fulfilled. There is no reason why using equation ( 7) for the full Gaussian noise would lead to a density on the manifold X . The KS statistics is ill-posed in this case. Nevertheless, we are interested in measuring the sensitivity to the noise, and thus consider the KS statistics as a relative performance measure. In Figure 4 , we display the KS values depending on different levels of noise, for the NG (blue) and FG (orange) noise compared with the ground truth von Mises distribution. Also, we embed the circle into higher dimensions D = 5, 10, 15, 20 and repeat this experiment. The result for D = 2 and D = 20 are shown in the first row (left and right). 4 For D = 2, we add the FOM model (which is independent of σ 2 ) horizontally. Besides, we depict the lower and upper bound for σ 2 from Chapter 3.3 with dashed vertical lines. In the lower-left image, we show the optimal KS values obtained for both models depending on D. The lower-right image shows the corresponding σ 2 for those optimal KS. In bright, the optimal average σ 2 is shown whereas the dark regions are the minimum respectively maximum values for σ 2 such that we outperformed the FOM benchmark. We note that for both cases, the averaged optimal σ 2 is within the predicted bounds for σ 2 . Optimal KS values depending on D. Bottom right: Optimal averaged σ 2 such that optimal KS is obtained (bright). The maximum and minimum σ 2 such that the FOM benchmark is outperformed (dark). The dashed horizontal lines are again the theoretical bounds. We used 10 seeds for the error bars and plot in log-scale. Several aspects are remarkable. The flat course of the KS vs. σ 2 plot is an indicator that the method is not very sensitive to noise and this does not change with the dimensionality of the embedding space. Also, the optimal KS values do not change much depending on D and the NS and FG model approach each other, as predicted. Interestingly, the onset for the increase in the KS value for the NS-noise is roughly 3 which is the radius of the circle. For increasing σ 2 , X resembles more and more a double cone which is not diffeomorphic to R 2 and thus the NF used to train the inflated distribution may not be able to capture the density close to the circle's center correctly.

4.2. MIXTURES ON S 2

We show that we can learn a complicated distribution, a mixture of 2-dimensional von Mises distributions on a sphere with radius 1, without using any knowledge about the manifold except for its intrinsic dimension. For certain magnitudes of σ 2 , we obtain similar estimates as the FOM benchmark as we can see in the direct comparison of the learned densities, see Figure 5 (top right), and the KS statistics (bottom right). As for the circle, the Gaussian restricted in the normal space allows for a greater range of noise magnitude without sacrificing the quality of the estimate. 

5. DISCUSSION

To overcome the limitations of NFs to learn a density p * (x) defined on a low-dimensional manifold, we proposed to embed the manifold into the ambient space such that it becomes diffeomorphic to R D , learn this inflated density using an NF, and, finally, deflate the inflated density according to Theorem 1. There, we provided sufficient conditions on the choice of inflation such that we can compute p * (x) exactly. Notably, we don't need to assume that p * (x) is supported on a flat manifold. Our method depends on some critical points which we addressed in Section 3.3. So far, the magnitude of noise σ 2 when using NFs on real-world data is somewhat chosen arbitrarily. As a first step to overcome this arbitrariness, we derived an upper bound for σ 2 in Proposition 1 and established an interesting connection to the manifold learning literature in Theorem 2. We hope that our theoretical results motivate some new research directions. Using full Gaussian noise to learn the inflated distribution smears information on p * (x), in particular if p * (x) has many local extrema. This loss of information may be especially impactful in out of distribution (OOD) detection or when it comes to adversarial robustness. Therefore, developing methods which allow to generate noise in the manifold's normal space could improve the performance of NFs on such tasks. Another interesting direction is to exploit the product form of equation ( 6) and learn low-dimensional representations by forcing the NF to be noise insensitive in the first d-components and noise sensitive in the remaining ones. Inverting the corresponding flow allows to sample directly from the manifold which has the potential to improve the generative ability of NFs. Now let x ∈ N x for an x ∈ X . Since X is Q-normally separated, P X -almost all x are uniquely determined by (x, ) such that x = x + . Therefore, we have for P X -almost all x = x + that P X ((x, x + dx) ∩ X ) = P (X,εn) {(x, ) ∈ R D × R D |x + ∈ (x, x + dx) ∩ X } = P X + ε n ∈ (x, x + dx) ∩ X = P (X ∈ (x, x + dx) ∩ X ) • P (x + ε n ∈ (x, x + dx) ∩ N x ) = P X ((x, x + dx) ∩ X ) • P X|X=x ((x, x + dx) ∩ N x ) where for the first equality we used equation ( 11) and for the third the fact that x and are almost surely uniquely determined by x. Both probability measures on the right-hand side have a density. For P X with respect to dV f , see Section 2, this density is p * (x). Similarly, since N x is a linear subspace of R D , q n (x|x) is the density of P X|X=x with respect to the volume form dV h where h is the mapping from R D-d to N x , see Remark 1. Then, the corresponding density of P X is with respect to the product measure dV h • dV f . However, this product measure is equivalent to λ D when restricted to subsets of X and thus q n is the density of P X with respect to λ D .foot_4 Therefore, we can write equation ( 12) in terms of densities as follows: q n (x) = p * (x)q n (x|x) and it holds that X q n (x)dx = X Nx p * (x)q n (x|x)dV h (x)dV f (x) = X p * (x)dV f (x) =1, as needed for a density on X . By setting x to x in equation ( 13), we obtain equation ( 7). This ends the proof. Remark 2 Note that in Theorem 1, we need that X is diffeomorphic to R D . This requires that the noise distribution q n (•|x) is continuous for all x.

A.2 PROOF OF PROPOSITION 1

The generating function f is an embedding for X and X = f (Z) has the density p * (x) for x ∈ X . We may extend the domain of p * (x) to include all points x ∈ R D using the Dirac-delta function as follows p * (x) = Z δ(x -f (z))π(z)dz After inflating X, we have that p Σ (x) = Z N (x; f (z), Σ)π(z)dz with covariance matrix Σ ∈ R D×D . Assume x = x for some x ∈ X . We Taylor expand f (x) around z 0 = f -1 (x) up to first order, f (z) ≈ f (z 0 ) + J f (z 0 )(z -z 0 ), and π(z) up to second order, π(z) ≈ π(z 0 ) + π(z 0 ) (z -z 0 ) + 1 2 (z -z 0 ) T π (z 0 )(z -z 0 ). where π(z 0 ) denotes the gradient and π (z 0 ) the Hessian of π evaluated at z 0 , thus π(z 0 ) ∈ R d and π (z 0 ) ∈ R d×d . Then, we can approximate p Σ (x) as follows: p Σ (x) ≈ 1 (2π) D det(Σ) Z exp(- 1 2 (z -z 0 ) T J T f Σ -1 J f (z -z 0 ))• • (π(z 0 ) + π(z 0 ) T (z -z 0 ) + 1 2 (z -z 0 ) T π (z 0 )(z -z 0 ))dz. ( ) Now define Σ-1 = J T f Σ -1 J f . Then, p Σ (x) ≈ det( Σ) (2π) D-d det(Σ) Z 1 (2π) d det( Σ) exp(- 1 2 (z -z 0 ) T Σ-1 (z -z 0 ))• • (π(z 0 ) + π(z 0 ) T (z -z 0 ) + 1 2 (z -z 0 ) T π (z 0 )(z -z 0 ))dz. Thus, we can exploit this Gaussian in Z-space and get p Σ (x) ≈ det( Σ) det(Σ) (π(z 0 ) + 1 2 E (z -z 0 ) T π (z 0 )(z -z 0 ) ) = det( Σ) det(Σ) (π(z 0 ) + 1 2 ||π (z 0 ) Σ|| + ), where stands for the elementwise multiplication and ||A|| + = d i,j=1 A ij for a R d × R d matrix A. For the special case where Σ = σ 2 I D , we can simplify this expression by exploiting that det( Σ) (2π) D-d det(Σ) = 1 (2π) D-d 2 σ -D σ -d g f = 1 (2πσ 2 ) D-d 2 g f . ( ) Thus, in total, we get for this special choice of Σ p σ (x) ≈ 1 (2πσ 2 ) D-d 2 g f (π(z 0 ) + σ 2 2 ||π (z 0 ) (J T f J f ) -1 || + ) = 1 (2πσ 2 ) D-d 2 g f π(z 0 )(1 + σ 2 2π(z 0 ) ||π (z 0 ) (J T f J f ) -1 || + ) We assume now σ 2 2π(z 0 ) ||π (z 0 ) (J T f J f ) -1 || + 1. Note that 1/(2πσ 2 ) D-d 2 from equation ( 23) is exactly the normalization constant obtained when inflating the manifold with Gaussian noise in the normal space, q n (x|x) = 1/(2πσ 2 ) D-d 2 . What follows is that lim σ 2 →0 p σ (x)/q n (x|x) = p * (x) as we wanted to show.

A.3 PROOF OF THEOREM 2

The theorem follows directly from the definition of the reach number τ X of X . It is defined as the supremum of all r ≥ 0 such that the orthogonal projection pr X on X is well-defined on the r-neighbourhood X r of X , X r := {x ∈ R D | dist(x, X ) ≤ r} where dist(x, X ) denotes the distance of x to X . Thus, τ X = sup r ≥ 0 | ∀x ∈ R D , dist(x, X ) ≤ r =⇒ ∃!x ∈ X s.t. dist(x, X ) = ||x -x|| , (26) see Definition 2.1. in Berenfeld & Hoffmann (2019) . By assumption τ X > 0. Thus for all x ∈ X τ X we have that x := pr X (x) is unique. Since X is a manifold, it must hold that x ∈ N x where N x denotes the normal space in x. Let the noise generating distributions be a uniform distribution on the ball with radius τ X , thus q n (x|x) = Uniform(x; B(x, τ X ) ∩ N x ), where B(x, τ X ) denotes a D-dimensional ball with radius τ X and center x.. Then, we have for X = x∈X N qn(•|x) that X = X τ X . ( ) Thus, X is Q-normally separable where Q := {q n (•|x)} x∈X .

B EXPERIMENTS B.1 TECHNICAL DETAILS FOR CIRCLE EXPERIMENTS

For the Normalizing Flow T -1 (x) we use a BNAF (Block Neural Autoregressive Flow) for the circle experiment. The number of hidden dimensions was adapted to the dimensionality of the data and the difficulty of the target density. These details are reported in the corresponding tabular. For the optimization scheme, we used Adam optimizer with an initial learning rate 0.1, a learning rate decay of 0.5 after 2000 optimization steps without improvement (learning rate patience). The batch size was set to 200. The total number of iterations (one iteration corresponds to updating the parameters using one batch sample) used is also reported in the tabular. No hyperparameter fine-tuning was done. For the FOM and χ 2 -noise models, we use the same architecture as for the D = 2 case. For the Normalizing Flow T -1 (x) we use rational-quadratic neural spline flows, alternating coupling layers and random feature permutations, see Durkan et al. (2019) . For the optimization scheme, we used AdamW optimizer with an initial learning rate 0.0003, a learning rate cosine decay to 0 after every 2000 optimization steps, see Loshchilov & Hutter (2016) , a weight decay of 0.0001 and a dropout probability of 0. The batch size was set to 200. No hyperparameter fine-tuning was done. See table 2 for more details. We use the same architecture for the FOM model and 3 seeds for the error bars. Coupling residual blocks hidden features bins spline range total parameters iterations 10 3 50 8 3 171,845 50000 



Thus, a regular continuously differentiable mapping (called immersion) which is, restricted to its image, a homeomorphism. Technically, the circle does not fulfill the conditions of Proposition 1 since the domain of f is not R. Note that our method still depends on how well an NF can learn the inflated distribution. Note that the scaling factor depends on D, qn(x|x) = 1/(2πσ 2 ) D-d 2 . This is because the column vectors of J f and J h form a basis of R D .



Figure 3: Learned densities for σ 2 = 0.01 (above) and σ 2 = 1 (below), respectively. First row: Samples used for training the respective model: FG (left), NG (middle), FOM/ χ 2 (right). The black line depicts the manifold X (a circle with radius 3) and the color codes the value of p * (x). Second row: Colored line: Learned density p(x) according to equation (7) multiplied by 3. Blackline: ground truth von Mises distribution.

Figure 4: KS values for the NG-(blue) and FG-noise method (orange) depending on σ 2 ∈ [10 -9 , 10] and the embedding dimension D = 5, 10, 15, 20 in log-scale. For D = 2 and D = 20 (top right), the two vertical lines represent the lower and upper bound for σ 2 estimated according to Chapter 3.3 with 10K samples. We plot horizontally the KS value obtained from FOM. Bottom left:Optimal KS values depending on D. Bottom right: Optimal averaged σ 2 such that optimal KS is obtained (bright). The maximum and minimum σ 2 such that the FOM benchmark is outperformed (dark). The dashed horizontal lines are again the theoretical bounds. We used 10 seeds for the error bars and plot in log-scale.

Figure 5: Left: Target density. Upper right: Learned densities using FOM and our method with Gaussian noise and σ 2 = 0.01. Lower right: KS vs. σ 2 plot of the Gaussian noise model (full and in normal space) compared to the FOM with the theoretical bounds from Chapter 3.3 for σ 2 depicted in vertical dashed lines (with 10K samples used to approximate these bounds).

BNAF details for circle experiments. B.2 TECHNICAL DETAILS FOR MIXTURE OF VON MISE DISTRIBUTIONS ON S 2

BNAF details for a mixture of von Mises on S 2 .

A APPENDIX

A.1 PROOF OF THEOREM 1We denote the probability measure of the random variable X as P X and it is defined on (X , B(X )) where B(X ) is the set of borel measure in R D intersected with X . For a realisation of X, say x, we denote the probability measure of the shifted random variable x + ε n as P X|X=x and it is defined on (N x , B(N x )). We extend both measures to (R D , B(R D )) by setting the probabilities to 0 whenever a set A ∈ B(R D ) has no intersection with X or N x , respectively. For instance, that means for x ∈ N x thatwhere (x, x + dx) denotes an infinitesimal volume element around x.The mapping (x, ))) and has the pushforward of P (X,εn) with regard to the mapping (x, ε n ) → x + ε n as probability measure where P (X,εn) is the joint measure of X and ε n . Thus, for A ∈ B( X ), we have thatThe target distribution is a mixture of four von Mises distributions, p * 1 (φ 1 , θ 1 ), p * 2 (φ 2 , θ 2 ), p * 3 (φ 3 , θ 3 ) and p * 4 (φ 4 , θ 4 ). Each of those distributions has the same product formWe set κ = 6. However, they differ in their mean values µ i and m i , see table 3 . We used 3 different seeds in total to obtain the confidence intervals. Table 3 : Mean values for the mixture of von Mises distributions.

