REPRESENTATIONAL ASPECTS OF DEPTH AND CONDI-TIONING IN NORMALIZING FLOWS

Abstract

Normalizing flows are among the most popular paradigms in generative modeling, especially for images, primarily because we can efficiently evaluate the likelihood of a data point. This is desirable both for evaluating the fit of a model, and for ease of training, as maximizing the likelihood can be done by gradient descent. However, training normalizing flows comes with difficulties as well: models which produce good samples typically need to be extremely deep -which comes with accompanying vanishing/exploding gradient problems. A very related problem is that they are often poorly conditioned: since they are parametrized as invertible maps from R d → R d , and typical training data like images intuitively is lowerdimensional, the learned maps often have Jacobians that are close to being singular. In our paper, we tackle representational aspects around depth and conditioning of normalizing flows-both for general invertible architectures, and for a particular common architecture-affine couplings. For general invertible architectures, we prove that invertibility comes at a cost in terms of depth: we show examples where a much deeper normalizing flow model may need to be used to match the performance of a non-invertible generator. For affine couplings, we first show that the choice of partitions isn't a likely bottleneck for depth: we show that any invertible linear map (and hence a permutation) can be simulated by a constant number of affine coupling layers, using a fixed partition. This shows that the extra flexibility conferred by 1x1 convolution layers, as in GLOW, can in principle be simulated by increasing the size by a constant factor. Next, in terms of conditioning, we show that affine couplings are universal approximators -provided the Jacobian of the model is allowed to be close to singular. We furthermore empirically explore the benefit of different kinds of paddinga common strategy for improving conditioning.

1. INTRODUCTION

Deep generative models are one of the lynchpins of unsupervised learning, underlying tasks spanning distribution learning, feature extraction and transfer learning. Parametric families of neural-network based models have been improved to the point of being able to model complex distributions like images of human faces. One paradigm that has received a lot attention is normalizing flows, which model distributions as pushforwards of a standard Gaussian (or other simple distribution) through an invertible neural network G. Thus, the likelihood has an explicit form via the change of variables formula using the Jacobian of G. Training normalizing flows is challenging due to a couple of main issues. Empirically, these models seem to require a much larger size than other generative models (e.g. GANs) and most notably, a much larger depth. This makes training challenging due to vanishing/exploding gradients. A very related problem is conditioning, more precisely the smallest singular value of the forward map G. It's intuitively clear that natural images will have a low-dimensional structure, thus a close-to-singular G might be needed. On the other hand, the change-of-variables formula involves the determinant of the Jacobian of G -1 , which grows larger the more singular G is. While recently, the universal approximation power of various types of invertible architectures has been studied (Dupont et al., 2019; Huang et al., 2020) if the input is padded with a sufficiently large number of all-0 coordinates, precise quantification of the cost of invertibility in terms of the depth required and the conditioning of the model has not been fleshed out. In this paper, we study both mathematically and empirically representational aspects of depth and conditioning in normalizing flows and answer several fundamental questions.

2.1. RESULTS ABOUT GENERAL ARCHITECTURES

In order to guarantee that the network is invertible, normalizing flow models place significant restrictions on the architecture of the model. The most basic question we can ask is how this restriction affects the expressive power of the model -in particular, how much the depth must increase to compensate. More precisely, we ask: Question 1: is there a distribution over R d which can be written as the pushforward of a Gaussian through a small, shallow generator, which cannot be approximated by the pushforward of a Gaussian through a small, shallow layerwise invertible neural network? Given that there is great latitude in terms of the choice of layer architecture, while keeping the network invertible, the most general way to pose this question is to require each layer to be a function of p parameters -i.e. f = f 1 •f 2 •• • ••f where • denotes function composition and each f i : R d → R d is an invertible function specified by a vector θ i ∈ R p of parameters. This framing is extremely general: for instance it includes layerwise invertible feedforward networks in which f i (x) = σ ⊗d (A i x + b i ), σ is invertible, A i ∈ R d×d is invertible, θ i = (A i , b i ) and p = d(d + 1). It also includes popular architectures based on affine coupling blocks (e.g. Dinh et al. (2014; 2016) ; Kingma & Dhariwal (2018) ) where each f i has the form f i (x Si , x [d]\Si ) = (x Si , x [d]\Si g i (x Si ) + h i (x Si )) for some S ⊂ [d] which we revisit in more detail in the following subsection. We answer this question in the affirmative: namely, we show for any k that there is a distribution over R d which can be expressed as the pushforward of a network with depth O(1) and size O(k) that cannot be (even very approximately) expressed as the pushforward of a Gaussian through a Lipschitz layerwise invertible network of depth smaller than k/p. Towards formally stating the result, let θ = (θ 1 , . . . , θ ) ∈ Θ ⊂ R d be the vector of all parameters (e.g. weights, biases) in the network, where θ i ∈ R p are the parameters that correspond to layer i, and let f θ : R d → R d denote the resulting function. Define R so that Θ is contained in the Euclidean ball of radius R. We say the family f θ is L-Lipschitz with respect to its parameters and inputs, if ∀θ, θ ∈ Θ : E x∼N (0,I d×d ) f θ (x) -f θ (x) ≤ L θ -θ and ∀x, y ∈ R d , f θ (x) -f θ (y) ≤ L x -y . 1 We will discuss the reasonable range for L in terms of the weights after the Theorem statement. We showfoot_1 : Theorem 1. For any k = exp(o(d)), L = exp(o(d)), R = exp(o(d)), we have that for d sufficiently large and any γ > 0 there exists a neural network g : R d+1 → R d with O(k) parameters and depth O(1), s.t. for any family {f θ , θ ∈ Θ} of layerwise invertible networks that are L-Lipschitz with respect to its parameters and inputs, have p parameters per layer and depth at most k/p we have ∀θ ∈ Θ, W 1 ((f θ ) #N , g #N ) ≥ 10γ 2 d Furthermore, for all θ ∈ Θ, KL((f θ ) #N , g #N ) ≥ 1/10 and KL(g #N , (f θ ) #N ) ≥ 10γ 2 d L 2 . Remark 1: First, note that while the number of parameters in both networks is comparable (i.e. it's O(k)), the invertible network is deeper, which usually is accompanied with algorithmic difficulties for training, due to vanishing and exploding gradients. For layerwise invertible generators, if we assume that the nonlinearity σ is 1-Lipschitz and each matrix in the network has operator norm at most M , then a depth network will have L = O(M )foot_2 and p = O(d 2 ). For an affine coupling network with g, h parameterized by H-layer networks with p/2 parameters each, 1-Lipschitz activations and weights bounded by M as above, we would similarly have L = O(M H ). Remark 2: We make a couple of comments on the "hard" distribution g we construct, as well as the meaning of the parameter γ and how to interpret the various lower bounds in the different metrics. The distribution g for a given γ will in fact be close to a mixture of k Gaussians, each with mean on the sphere of radius 10γ 2 d and covariance matrix γ 2 I d . Thus this distribution has most of it's mass in a sphere of radius O(γ 2 d) -so the Wasserstein guarantee gives close to a trivial approximation for g. The KL divergence bounds are derived by so-called transport inequalities between KL and Wasserstein for subgaussian distributions Bobkov & Götze (1999) . The discrepancy between the two KL divergences comes from the fact that the functions g, f θ may have different Lipschitz constants, hence the tails of g #N and f #N behave differently. In fact, if the function f θ had the same Lipschitz constant as g, both KL lower bounds would be on the order of a constant.

2.2. RESULTS ABOUT AFFINE COUPLING ARCHITECTURES

Next, we prove several results for a particularly common normalizing flow architectures: those based on affine coupling layers (Dinh et al., 2014; 2016; Kingma & Dhariwal, 2018) . The appeal of these architecture comes from training efficiency. Although layerwise invertible neural networks (i.e. networks for which each layer consists of an invertible matrix and invertible pointwise nonlinearity) seem like a natural choice, in practice these models have several disadvantages: for example, computing the determinant of the Jacobian is expensive unless the weight matrices are restricted. Consequently, it's typical for the transformations in a flow network to be constrained in a manner that allows for efficient computation of the Jacobian determinant. The most common building block is an affine coupling block, originally proposed by Dinh et al. (2014; 2016)  f : R d → R d , s.t. f (x S , x [d]\S ) = (x S , x [d]\S s(x S ) + t(x S )) Of course, the modeling power will be severely constrained if the coordinates in S never change: so typically, flow models either change the set S in a fixed or learned way (e.g. alternating between different partitions of the channel in Dinh et al. (2016) or applying a learned permutation in Kingma & Dhariwal (2018) ). Of course, a permutation is a discrete object, so difficult to learn in a differentiable manner -so Kingma & Dhariwal (2018) simply learns an invertible linear function (i.e. a 1x1 convolution) as a differentiation-friendly relaxation thereof.

2.2.1. THE EFFECT OF CHOICE OF PARTITION ON DEPTH

The first question about affine couplings we ask is how much of a saving in terms of the depth of the network can one hope to gain from using learned partitions (ala GLOW) as compared to a fixed partition. More precisely: Question 2: Can models like Glow (Kingma & Dhariwal, 2018) be simulated by a sequence of affine blocks with a fixed partition without increasing the depth by much? We answer this question in the affirmative at least for equally sized partitions (which is what is typically used in practice). We show the following surprising fact: consider an arbitrary partition (S, [2d] \ S) of [2d], such that S satisfies |S| = d, for d ∈ N. Then for any invertible matrix T ∈ R 2d×2d , the linear map T : R 2d → R 2d can be exactly represented by a composition of O(1) affine coupling layers that are linear, namely have the form L i (x S , x [2d]\S ) = (x S , B i x [2d]\S +A i x S ) or L i (x S , x [2d]\S ) = (C i x S + D i x [2d]\S , x [2d]\S ) for matrices A i , B i , C i , D i ∈ R d×d , s.t. each B i , C i is diagonal. For convenience of notation, without loss of generality let S = [d]. Then, each of the layers L i is a matrix of the form I 0 A i B i or C i D i 0 I , where the rows and columns are partitioned into blocks of size d. With this notation in place, we show the following theorem: Theorem 2. For all d ≥ 4, there exists a k ≤ 24 such that for any invertible T ∈ R 2d×2d with det(T ) > 0, there exist matrices A i , D i ∈ R d×d and diagonal matrices B i , C i ∈ R d×d ≥0 for all i ∈ [k] such that T = k i=1 I 0 A i B i C i D i 0 I Note that the condition det(T ) > 0 is required, since affine coupling networks are always orientationpreserving. Adding one diagonal layer with negative signs suffices to model general matrices. In particular, since permutation matrices are invertible, this means that any applications of permutations to achieve a different partition of the inputs (e.g. like in Glow (Kingma & Dhariwal, 2018) ) can in principle be represented as a composition of not-too-many affine coupling layers, indicating that the flexibility in the choice of partition is not the representational bottleneck. It's a reasonable to ask how optimal the k ≤ 24 bound is -we supplement our upper bound with a lower bound, namely that k ≥ 3. This is surprising, as naive parameter counting would suggest k = 2 might work. Namely, we show: Theorem 3. For all d ≥ 4 and k ≤ 2, there exists an invertible T ∈ R 2d×2d with det(T ) > 0, s.t. for all A i , D i ∈ R d×d and for all diagonal matrices B i , C i ∈ R d×d ≥0 , i ∈ [k] it holds that T = k i=1 I 0 A i B i C i D i 0 I Beyond the relevance of this result in the context of how important the choice of partitions is, it also shows a lower bound on the depth for an equal number of nonlinear affine coupling layers (even with quite complex functions s and t in each layer) -since a nonlinear network can always be linearized about a (smooth) point to give a linear network with the same number of layers. In other words, studying linear affine coupling networks lets us prove a depth lower bound/depth separation for nonlinear networks for free. Finally, in Section 5.3, we include an empirical investigation of our theoretical results on synthetic data, by fitting random linear functions of varying dimensionality with linear affine networks of varying depths in order to see the required number of layers. The results there suggest that the constant in the upper bound is quite loose -and the correct value for k is likely closer to the lower bound -at least for random matrices.

2.2.2. UNIVERSAL APPROXIMATION WITH ILL-CONDITIONED AFFINE COUPLING NETWORKS

Finally, we turn to universal approximation and the close ties to conditioning. Namely, a recent work (Theorem 1 of Huang et al. (2020) ) showed that deep affine coupling networks are universal approximators if we allow the training data to be padded with sufficiently many zeros. While zero padding is convenient for their analysis (in fact, similar proofs have appeared for other invertible architectures like Augmented Neural ODEs (Zhang et al.) ), in practice models trained on zero-padded data often perform poorly (see Appendix C). In fact, we show that neither padding nor depth is necessary representationally: shallow models without zero padding are already universal approximators in Wasserstein. Theorem 4 (Universal approximation without padding). Suppose that P is the standard Gaussian measure in R n with n even and Q is a distribution on R n with bounded support and absolutely continuous with respect to the Lebesgue measure. Then for any > 0, there exists a depth-3 affine coupling network g, with maps s, t represented by feedforward ReLU networks such that W 2 (g # P, Q) ≤ . Remark 1: A shared caveat of the universality construction in Theorem 4 with the construction in Huang et al. (2020) is that the resulting network is poorly conditioned. In the case of the construction in Huang et al. (2020) , this is obvious because they pad the d-dimensional training data with d additional zeros, and a network that takes as input a Gaussian distribution in R 2d (i.e. has full support) and outputs data on d-dimensional manifold (the space of zero padded data) must have a singular Jacobian almost everywhere. 4 In the case of Theorem 4, the condition number of the network blows up at least as quickly as 1/ as we take the approximation error → 0, so this network is also ill-conditioned if we are aiming for a very accurate approximation. Remark 2: Based on Theorem 3, the condition number blowup of either the Jacobian or the Hessian is necessary for a shallow model to be universal, even when approximating well-conditioned linear maps (see Remark 7). The network constructed in Theorem 4 is also consistent with the lower bound from Theorem 1, because the network we construct in Theorem 4 is highly non-Lipschitz and uses many parameters per layer.

3. RELATED WORK

On the empirical side, flow models were first popularized by Dinh et al. (2014) , who introduce the NICE model and the idea of parametrizing a distribution as a sequence of transformations with triangular Jacobians, so that maximum likelihood training is tractable. Quickly thereafter, Dinh et al. (2016) improved the affine coupling block architecture they introduced to allow non-volumepreserving (NVP) transformations, Papamakarios et al. (2017) introduced an autoregressive version, and finally Kingma & Dhariwal (2018) introduced 1x1 convolutions in the architecture, which they view as relaxations of permutation matrices-intuitively, allowing learned partitions for the affine blocks. Subsequently, there have been variants on these ideas: (Grathwohl et al., 2018; Dupont et al., 2019; Behrmann et al., 2018) viewed these models as discretizations of ODEs and introduced ways to approximate determinants of non-triangular Jacobians, though these models still don't scale beyond datasets the size of CIFAR10. The conditioning/invertibility of trained models was experimentally studied in (Behrmann et al., 2019) , along with some "adversarial vulnerabilities" of the conditioning. Mathematically understanding the relative representational power and statistical/algorithmic implications thereof for different types of generative models is still however a very poorly understood and nascent area of study. Most closely related to our results are the recent works of Huang et al. (2020) and Zhang et al.. Both prove universal approximation results for invertible architectures (the former affine couplings, the latter neural ODEs) if the input is allowed to be padded with zeroes. As already expounded upon in the previous sections -our results prove universal approximation even without padding, but we focus on more fine-grained implications to depth and conditioning of the learned model. Another work (Kong & Chaudhuri, 2020) studies the representational power of Sylvester and Householder flows, normalizing flow architectures which are quite different from affine coupling networks. In particular, they prove a depth lower bound for local planar flows with bounded weights; for planar flows, our general Theorem 1 can also be applied, but the resulting lower bound instances are very different (ours targets multimodality, theirs targets tail behavior). More generally, there are various classical results that show a particular family of generative models can closely approximate most sufficiently regular distributions over some domain. Some examples are standard results for mixture models with very mild conditions on the component distribution (e.g. Gaussians, see (Everitt, 2014) ); Restricted Boltzmann Machines and Deep Belief Networks (Montúfar et al., 2011; Montufar & Ay, 2011) ; GANs (Bailey & Telgarsky, 2018) .

4. PROOF SKETCH OF THEOREM 1: DEPTH LOWER BOUNDS ON INVERTIBLE MODELS

In this section we sketch the proof of Theorem 1. The intuition behind the k/p bound on the depth relies on parameter counting: a depth k/p invertible network will have k parameters in total (p per layer)-which is the size of the network we are trying to represent. Of course, the difficulty is that we need more than f θ , g simply not being identical: we need a quantitative bound in various probability metrics. The proof will proceed as follows. First, we will exhibit a large family of distributions (of size exp(kd)), s.t. each pair of these distributions has a large pairwise Wasserstein distance between them. Moreover, each distribution in this family will be approximately expressible as the pushforward of the Gaussian through a small neural network. Since the family of distributions will have a large pairwise Wasserstein distance, by the triangle inequality, no other distribution can be close to two distinct members of the family. Second, we can count the number of "approximately distinct" invertible networks of depth l: each layer is described by p weights, hence there are lp parameters in total. The Lipschitzness of the neural network in terms of its parameters then allows to argue about discretizations of the weights. Formally, we show the following lemma: Lemma 1 (Large family of well-separated distributions). For every k = o(exp(d)), for d sufficiently large and γ > 0 there exists a family D of distributions, s.t. |D| ≥ exp(kd/20) and: 1. Each distribution p ∈ D is a mixture of k Gaussians with means {µ i } k i=1 , µ i 2 = 20γ 2 d and covariance γ 2 I d . 2. ∀p ∈ D and ∀ > 0, we have W 1 (p, g #N ) ≤ for a neural network g with at most O(k) parameters. 5 3. For any p, p ∈ D, W 1 (p, p ) ≥ 20γ 2 d. The proof of this lemma will rely on two ideas: first, we will show that there is a family of distributions consisting of mixtures of Gaussians with k components -s.t. each pair of members of this family is far in W 1 distance, and each member in the family can be approximated by the pushforward of a network of size O(k). The reason for choosing mixtures is that it's easy to lower bound the Wasserstein distance between two mixtures with equal weights and covariance matrices in terms of the distances between the means. We show this as Lemma 5 in Appendix A. Given this, to design a family of mixtures of Gaussians with large pairwise Wasserstein distance, it suffices to construct a large family of k-tuples for the means, s.t. for each pair of k-tuples ({µ i } k i=1 , {ν i } k i=1 ), there exists a set S ⊆ [k], |S| ≥ k/10, s.t. ∀i ∈ S, min 1≤j≤k µ i -ν j 2 ≥ 20γ 2 d. We do this by leveraging ideas from coding theory (the Gilbert-Varshamov bound Gilbert (1952); Varshamov (1957) ). Namely, we first pick a set of exp(Ω(d)) vectors of norm 20γ 2 d, each pair of which has a large distance; second, we pick a large number (exp(Ω(kd))) of k-tuples from this set at random, and show with high probability, no pair of tuples intersect in more than k/10 elements. This is subsumed by Lemmas 6 and 7 in Section A. To handle part 2 of Lemma 1, we also show that a mixture of k Gaussians can be approximated as the pushforward of a Gaussian through a network of size O(k). The idea is rather simple: the network will use a sample from a standard Gaussian in R d+1 . We will subsequently use the first coordinate to implement a "mask" that most of the time masks all but one randomly chosen coordinate in [k] . The remaining coordinates are used to produce a sample from each of the components in the Gaussian, and the mask is used to select only one of them. For details, see Section A. With this lemma in hand, we finish the Wasserstein lower bound with a standard epsilon-net argument, using the parameter Lipschitzness of the invertible networks by showing the number of "different" invertible neural networks is on the order of O (LR) d . This is Lemma 8 in Appendix A. The proof of Theorem 1 can then be finished by triangle inequality: since the family of distributions has large Wasserstein distance, by the triangle inequality, no other distribution can be close to two distinct members of the family. Finally, KL divergence bounds can be derived from the Bobkov-Götze inequality Bobkov & Götze (1999), which lower bounds KL divergence by the squared Wasserstein distance. The details are in Section A.

5. PROOF SKETCH OF THEOREMS 2 AND 3: SIMULATING LINEAR FUNCTIONS WITH AFFINE COUPLINGS

In this section, we will prove Theorems 3 and 2. Before proceeding to the proofs, we will introduce a bit of helpful notation. We let GL + (2d, R) denote the group of 2d × 2d matrices with positive determinant (see Artin (2011) for a reference on group theory). The lower triangular linear affine coupling layers are the subgroup A L ⊂ GL + (2d, R) of the form A L = I 0 A B : A ∈ R d×d , B is diagonal with positive entries , and likewise the upper triangular linear affine coupling layers are the subgroup A U ⊂ GL + (2d, R) of the form A U = C D 0 I : D ∈ R d×d , C is diagonal with positive entries . Finally, define A = A L ∪ A U ⊂ GL + (2d, R ). This set is not a subgroup because it is not closed under multiplication. Let A k denote the kth power of A, i.e. all elements of the form a 1 • • • a k for a i ∈ A.

5.1. UPPER BOUND

The main result of this section is the following: Theorem 5 (Restatement of Theorem 2). There exists an absolute constant 1 < K ≤ 47 such that for any d ≥ 1, GL + (2d, R) = A K . In other words, any linear map with positive determinant ("orientation-preserving") can be implemented using a bounded number of linear affine coupling layers. Note that there is a difference in a factor of two between the counting of layers in the statement of Theorem 2 and the counting of matrices in Theorem 5, because each layer is composed of two matrices. In group-theoretic language, this says that A generates GL + (2d, R) and furthermore the diameter of the corresponding (uncountably infinite) Cayley graph is upper bounded by a constant independent of d. The proof relies on the following two structural results. The first one is about representing permutation matrices, up to sign, using a constant number of linear affine coupling layers: Lemma 2. For any permutation matrix P ∈ R 2d×2d , there exists P ∈ A 21 with | Pij | = |P ij | for all i, j. The second one proves how to represent using a constant number of linear affine couplings matrices with special eigenvalue structure: Lemma 3. Let M be an arbitrary invertible d × d matrix with distinct real eigenvalues and S be a d × d lower triangular matrix with the same eigenvalues as M -1 . Then M 0 0 S ∈ A 4 . Given these Lemmas, we briefly describe the strategy to prove Theorem 5. Every matrix has a an LU P factorization Horn & Johnson (2012) into a lower-triangular, upper-triangular, and permutation matrix. Lemma 2 takes care of the permutation part, so what remains is building an arbitrary lower/upper triangular matrix; because the eigenvalues of lower-triangular matrices are explicit, a careful argument allows us to reduce this to Lemma 3. All the proofs are in Section B.

5.2. LOWER BOUND

We proceed to the lower bound. Note, a simple parameter counting argument shows that for sufficiently large d, at least four affine coupling layers are needed to implement an arbitrary linear map (each affine coupling layer has only d 2 + d parameters whereas GL + (2d, R) is a Lie group of dimension 4d 2 ). Perhaps surprisingly, it turns out that four affine coupling layers do not suffice to construct an arbitrary linear map. We prove this in the following Theorem. Theorem 6 (Restatement of Theorem 3). For d ≥ 4, A 4 is a proper subset of GL + (2d, R). In other words, there exists a matrix T ∈ GL + (2d, R) which is not in A 4 . Again, this translates to the result in Theorem 3 because each layer corresponds to two matrices -so this shows two layers are not enough to get arbitrary matrices. The key observation is that matrices in A L A U A L A U satisfy a strong algebraic invariant which is not true of arbitrary matrices. This invariant can be expressed in terms of the Schur complement Zhang (2006) : Lemma 4. Suppose that T = X Y Z W is an invertible 2d × 2d matrix and suppose there exist matrices A, E ∈ R d×d , D, H ∈ R d×d and diagonal matrices B, F ∈ R d×d , C, G ∈ R d×d such that T = I 0 A B C D 0 I I 0 E F G H 0 I . Then the Schur complement T /X := W -ZX -1 Y is similar to X -1 C: more precisely, if U = Z -AX then T /X = U X -1 CU -1 . The proof of this Lemma is presented in Appendix B, as well as the resulting proof of Theorem 6. We remark that the argument in the proof is actually fairly general; it can be shown, for example, that for a random choice of X and W from the Ginibre ensemble, that T cannot typically be expressed in A 4 . So there are significant restrictions on what matrices can be expressed even four affine coupling layers. Remark 7 (Connection to Universal Approximation). As mentioned earlier, this lower bound shows that the map computed by general 4-layer affine coupling networks is quite restricted in its local behavior (it's Jacobian cannot be arbitrary). This implies that smooth 4-layer affine coupling networks, where smooth means the Hessian (of each coordinate of the output) is bounded in spectral norm, cannot be universal function approximators as they cannot even approximate some linear maps. In contrast, if we allow the computed function to be very jagged then three layers are universal (see Theorem 4).

5.3. EXPERIMENTAL RESULTS

We also verify the bounds from this section. At least on randomly chosen matrices, the correct bound is closer to the lower bound. Precisely, we generate (synthetic) training data of the form Az, where z ∼ N (0, I) for a fixed d × d square matrix A with random standard Gaussian entries and train a linear affine coupling network with n = 1, 2, 4, 8, 16 layers by minimizing the loss E z∼N (0,I) (f n (z) -Az) 2 . We are training this "supervised" regression loss instead of the standard unsupervised likelihood loss to minimize algorithmic (training) effects as the theorems are focusing on the representational aspects. The results for d = 16 are shown in Figure 1 , and more details are in Section C. To test a different distribution other than the Gaussian ensemble, we also generated random Toeplitz matrices with constant diagonals by sampling the value for each diagonal from a standard Gaussian and performed the same regression experiments. We found the same dependence on number of layers but an overall higher error, suggesting that that this distribution is slightly 'harder'. We provide results in Section C. We also regress a nonlinear RealNVP architecture on the same problems and see a similar increase in representational power though the nonlinear models seem to require more training to reach good performance. Additional Remarks Finally, we also note that there are some surprisingly simple functions that cannot be exactly implemented by a finite affine coupling network. For instance, an entrywise tanh function (i.e. an entrywise nonlinearity) cannot be exactly represented by any finite affine coupling network, regardless of the nonlinearity used. Details of this are in Appendix E.

6. PROOF SKETCH OF THEOREM 4: UNIVERSAL APPROXIMATION WITH ILL-CONDITIONED AFFINE COUPLING NETWORKS

In this section, we sketch the proof of Theorem 4 to show how to approximate a distribution in R n using three layers of affine coupling networks, where the dimension n = 2d is even. The partition in the affine coupling network is between the first d coordinates and second d coordinates in R 2d . The first element in the proof is a well-known theorem from optimal transport called Brenier's theorem, which states that for Q a probability measure over R n satisfying weak regularity conditions (see Theorem 9 in Section D), there exists a map ϕ : R n → R n such that if X ∼ N (0, I n×n ), then the pushforward ϕ # (X) is distributed according to Q. The proof then proceeds by using a lattice-based encoding and decoding scheme. Concretely, let > 0 be a small constant, to be taking sufficiently small. Let ∈ (0, ) be a further constant, taken sufficiently small with respect to and similar for wrt . Let the input to the affine coupling Figure 1 : Fitting 32-dimensional linear maps on a using n-layer linear affine coupling networks. The squared Frobenius error is normalized by 1/d 2 so it is independent of dimensionality. We shade the standard error regions of these losses across the seeds tried. network be X = (X 1 , X 2 ) such that X 1 ∼ N (0, I d×d ) and X 2 ∼ N (0, I d×d ). Let f (x) be the map which rounds x ∈ R d to the closest grid point in the lattice Z d and define g(x) = x -f (x). Note that for a point of the form z = f (x) + y for y which is not too large, we have that f (z) = f (x) and g(z) = y. Suppose the optimal transportation map from Brenier's Theorem is ϕ (x) = (ϕ 1 (x), ϕ 2 (x)) where ϕ 1 , ϕ 2 : R d → R n correspond to the two halves of the output. Now we consider the following sequence of maps, all which form an affine coupling layer: (X 1 , X 2 ) → (X 1 , X 2 + f (X 1 )) (1) → (f (ϕ 1 (f (X 1 ), X 2 )) + ϕ 2 (f (X 1 ), X 2 ) + O( ), X 2 + f (X 1 )) (2) → (f (ϕ 1 (f (X 1 ), X 2 )) + ϕ 2 (f (X 1 ), X 2 ) + O( ), ϕ 2 (f (X 1 ), X 2 ) + O( / )). (3) To explicitly see why the above are affine coupling layers, in the first step we take s 1 (x) = log( ) 1 and t 1 (x) = f (x). In the second step, we take s 2 (x) = log( ) 1 and t 2 is defined by t 2 (x) = f (ϕ 1 (f (x), g(x))) + ϕ 2 (f (x), g(x)). In the third step, we take s 3 (x) = log( ) 1 and define t 3 (x) = g(x) . Taking sufficiently good approximations to all of the maps allows to approximate this map with neural networks, which we formalize in Appendix D.

6.1. EXPERIMENTAL RESULTS

On the empirical side, we explore the effect that different types of padding has on the training on various synthetic datasets. For Gaussian padding, this means we add to the d-dimensional training data point, an additional d dimensions sampled from N (0, I d ). We consistently observe that zero padding has the worst performance and Gaussian padding has the best performance. On Figure 2 we show the performance of a simple RealNVP architecture trained via max-likelihood on a mixture of 4 Gaussians, as well as plot the condition number of the Jacobian during training for each padding method. The latter gives support to the fact that conditioning is a major culprit for why zero padding performs so badly. In Appendix C.2 we provide figures from more synthetic datasets.

7. CONCLUSION

Normalizing flows are one of the most heavily used generative models across various domains, though we still have a relatively narrow understanding of their relative pros and cons compared to other models. In this paper, we tackled representational aspects of two issues that are frequent sources of training difficulties, depth and conditioning. We hope this work will inspire more theoretical study of fine-grained properties of different generative models.

A MISSING PROOFS FOR SECTION 4

A.1 WASSERSTEIN DISTANCE FOR MIXTURES Lemma 5. Let µ and ν be two mixtures of k spherical Gaussians in d dimensions with mixing weights 1/k, means (µ 1 , µ 2 , . . . , µ k ) and (ν 1 , ν 2 , . . . , ν k ) respectively, and with all of the Gaussians having spherical covariance matrix γ 2 I for some γ > 0. Suppose that there exists a set S ⊆ [k] with |S| ≥ k/10 such that for every i ∈ S, min 1≤j≤k µ i -ν j 2 ≥ 20γ 2 d. Then W 1 (µ, ν) = Ω(γ √ d). Proof. By the dual formulation of Wasserstein distance (Kantorovich-Rubinstein Theorem) Villani ( 2003), we have W 1 (µ, ν) = sup ϕ ϕdµ -ϕdν where the supremum is taken over all 1-Lipschitz functions ϕ. Towards lower bounding this, consider ϕ(x) = max(0, 2γ √ d -min i∈S x iµ i ) and note that this function is 1-Lipschitz and always valued in [0, 2γ √ d]. For a single Gaussian Z ∼ N (0, γ 2 I d×d ), observe that E Z∼N (0,γ 2 I) [max(0, 2γ √ d -Z )] ≥ 2γ √ d -E Z) [ Z ] ≥ 2γ √ d -E Z∼N [ Z 2 ] ≥ γ √ d. Therefore, we see that ϕdµ = Ω(γ √ d) by combining the above calculation with the fact that at least 1/10 of the centers for µ are in S. On the other hand, for Z ∼ N (0, γ 2 I d×d ) we have Pr( Z 2 ≥ 10γ 2 d) ≤ 2e -10d (e.g. by Bernstein's inequality Vershynin (2018) , as Z 2 is a sum of squares of Gaussians, i.e. a χ 2 -random variable). In particular, since the points in S do not have a close point in {ν i } k i=1 , we similarly have ϕdν = O(e -10d γ √ d) = o(γ √ d) , since very little mass from each Gaussian in ν i lands in the support of ϕ by the separation assumption. Combining the bounds gives the result.

A.2 CONSTRUCTING TUPLES OF WELL-SEPARATED MEANS

First, by elementary Chernoff bounds, we have the following result: Lemma 6 (Large family of well-separated points). Let > 0. There exists a set {v 1 , v 2 , . . . , v N } of vectors v i ∈ R d , v i = 1 with N = exp(d 2 /4), s.t. v i -v j 2 ≥ 2(1 -) for all i = j. Proof. Recall that for a random unit vector v on the sphere in d dimensions, Pr(v i > t/ √ d) ≤ e -t 2 /2 . (This is a basic fact about spherical caps, see e.g. Rao (2011)) . By spherical symmetry and the union bound, this means for two unit vectors v, w sampled uniformly at random Pr(| v, w | > t/ √ d) ≤ 2e -t 2 /2 . Taking t = √ d gives that the probability is 2e -d 2 /2 ; therefore if draw N i.i.d. vectors, the probability that two have inner product larger than in absolute value is at most N 2 e -d 2 /2 < 1 if N = e d 2 /4 , which in particular implies such a collection of vectors exists. To construct tuples with small intersections, we use the following result by Rödl: Lemma 7 (Rödl & Thoma (1996) ). There exists a set consisting of ( N 2k ) k/10 subsets of size k of [N ], s.t. no pair of subsets intersect in more than k/10 elements.

A.3 EPSILON-NET COUNT

The following lemma is immediate: Lemma 8. Suppose that Θ ⊂ R d is contained in a ball of radius R > 0 and f θ is a family of invertible layerwise networks which is L-Lipschitz with respect to its parameters. Then there exists a set of neural networks S = {f i }, s.t. |S | = O ( LR ) d and for every θ ∈ Θ there exists a f i ∈ S , s.t. E x∼N (0,I d×d ) f θ (x) -f i (x) ∞ ≤ .

A.4 SIMULATING A MIXTURE WITH A NEURAL NETWORK

Lemma 9. Let p : R d → R + be a mixture of k Gaussians with means {µ i } k i=1 , µ i 2 = 20γ 2 d and covariance γ 2 I d . Then, ∀ > 0, we have W 1 (p, g #N ) ≤ for a neural network g with O(k) parameters.foot_5 Moreover, for every 1-Lipschitz φ : R d → R + and X ∼ g #N , φ(X) is O(γ 2 d)-subgaussian. Proof. We will use a construction similar to Arora et al. (2017) . Since the latent variable dimension is d + 1, the idea is to use the first variable, say h as input to a "selector" circuit which picks one of the components of the mixture with approximately the right probability, then use the remaning dimensions-say variable z, to output a sample from the appropriate component. For notational convenience, let M = 20γ 2 d. Let {h i } k-1 i=1 be real values that partition R into k intervals that have equal probability under the Gaussian measure. Then, the map f (h, z) = γz + k i=1 1(h ∈ (h i-1 , h i ])µ i (4) exactly generates the desired mixture, where h 0 is understood to be -∞ and h k = +∞. To construct g, first we approximate the indicators using two ReLUs, s.t. we design for each interval (h i-1 , h i ] a function 1i , s.t.: (1) 1i (h) = 1(h ∈ (h i-1 , h i ]) unless h ∈ [h i-1 , h i-1 + δ + i-1 ] ∪ [h i -δ - i , h i ], and the Gaussian measure of the union of the above two intervals is δ. (2) i 1i (h) = 1. The constructions of the functions 1i above can be found in Arora et al. (2017) , Lemma 3. We subsequently construct the neural network f (h, z) using ReLUs defined as f (h, z) = γz + k i=1 ReLU(-M (1 -1i (h)) + µ i ) -ReLU(-M (1 -1i (h)) -µ i )) . (5) Denoting B := k-1 i=1 [h i -δ - i , h i + δ + i ] note that if h / ∈ B, ∀z, f (h, z) = f (h, z), as desired. If h ∈ [h i -δ - i , h i + δ + i ], f (h, z) by construction will be γz + k i=1 w i (h)µ i for some w i (h) ∈ [0, 1] s.t. i w i (h) = 1. Denoting by φ(h, z) the joint pdf of h, z, by the coupling definition of W 1 , we have W 1 (f #N , µ) ≤ h∈R,z∈R d f (h, z) -f (h, z) 1 dφ(h, z) = h∈R k i=1 1(h ∈ (h i-1 , h i ])µ i - k i=1 ReLU(-M (1 -1i (h)) + µ i )- ReLU(-M (1 -1i (h)) -µ i )) 1 dφ(h) = h∈B k i=1 1(h ∈ (h i-1 , h i ])µ i - i w i (h)µ i 1 dφ(h) ≤ h∈B max i,j |µ i -µ j | 1 dφ(h) = h∈B 2M √ ddφ(h) = 2M √ d Pr [h ∈ B] = 2M √ dkδ taking σ 1 = k i=1 σ i1 and σ 2 = k i=1 σ i2 proves the desired result since π = σ 1 σ 2 and σ 1 , σ 2 are both of order at most 2. It remains to prove the result for a single cycle c of length r. The cases r ≤ 2 are trivial. Without loss of generality, we assume c = (1 • • • r). Let σ 1 (1) = 2, σ 1 (2) = 1, and otherwise σ 1 (s) = r + 3 -s. Let σ 2 (1) = 3, σ 2 (2) = 2, σ 2 (3) = 1, and otherwise σ 2 (s) = r + 4 -s. It's easy to check from the definition that both of these elements are order at most 2. We now claim c = σ 2 • σ 1 . To see this, we consider the following cases: 1. σ 2 (σ 1 (1)) = σ 2 (2) = 2. 2. σ 2 (σ 1 (2)) = σ 2 (1) = 3. 3. σ 2 (σ 1 (r)) = σ 2 (3) = 1. 4. For all other s, σ 2 (σ 1 (s)) = σ 2 (r + 3 -s) = s + 1. In all cases we see that c(s) = σ 2 (σ 1 (s)) which proves the result. Next, we supply the proof of Lemma 2 Proof of Lemma 2. It is easy to see that swapping two elements is possible in a fashion that doesn't affect other dimensions by the following 'signed swap' procedure requiring 3 matrices: (x, y) → (x, y -x) → (y, y -x) → (y, -x). Next, let L = {1, . . . , d} and R = {d + 1, . . . , 2d}. There will be an equal number of elements which in a particular permutation will be permuted from L to R as those which will be permuted from R to L. We can choose an arbitrary bijection between the two sets of elements and perform these 'signed swaps' in parallel as they are disjoint, using a total of 3 matrices. The result of this will be the elements partitioned into L and R that would need to be mapped there. We can also (up to sign) transpose elements within a given set L or R via the following computation using our previous 'signed swaps' that requires one 'storage component' in the other set: ([x, y], z) → ([z, y], -x) → ([z, x], y) → ([y, x], -z). So, up to sign, we can in 9 matrices compute any transposition in L or R separately. In fact, since any permutation can be represented as the product of two order-2 permutations (Lemma 10) and any order-2 permutation is a disjoint union of transpositions, we can implement an order-2 permutation up to sign using 9 matrices and an arbitrary permutation up to sign using 18 matrices. In total, we used 3 matrices to move elements to the correct side and 18 matrices to move them to their correct position, for a total of 21 matrices. Lemma 11. Suppose A ∈ R n×n is a matrix with n distinct real eigenvalues. Then there exists an invertible matrix S ∈ R n×n such that A = SDS -1 where D is a diagonal matrix containing the eigenvalues of A. Proof. Observe that for every eigenvalue λ i of A, the matrix (A -λ i I) has rank n -1 by definition, hence there exists a corresponding real eigenvector v i by taking a nonzero solution of the real linear system (A -λI)v = 0. Taking S to be the linear operator which maps e i to standard basis vector v i , and D = diag(λ 1 , . . . , λ n ) proves the result. Next, we give the proof of Lemma 3 Proof of Lemma 3. Let D = (M -I)E -1 , H = (M -1 -I)E -1 , E = -AM, where A is an invertible matrix that will be specified later. We can multiply out with these values giving I 0 A I I D 0 I I 0 E I I H 0 I = I 0 A I I (I -M )M -1 A -1 0 I I 0 -AM I I (I -M -1 )M -1 A -1 0 I = I (M -1 -I)A -1 A AM -1 A -1 I 0 -AM I I (I -M -1 )M -1 A -1 0 I = M (M -1 -I)A -1 0 AM -1 A -1 I (I -M -1 )M -1 A -1 0 I = M 0 0 AM -1 A -1 Here what remains is to guarantee AM -1 A -1 = S. Since S and M -1 have the same eigenvalues, by Lemma 11 there exist real matrices U, V such that S = U XU -1 and M -1 = V XV -1 for the same diagonal matrix X, hence S = U V -1 M -1 V U -1 . Therefore taking A = U V -1 gives the result. Now that we have the Lemmas, we prove the upper bound. Proof of Theorem 5. Recall that our goal is to show that GL + (2d, R) ⊂ A K for an absolute constant K > 0. To show this, we consider an arbitrary matrix T ∈ GL + (2d, R), i.e. an arbitrary matrix T : 2d × 2d with positive determinant, and show how to build it as a product of a bounded number of elements from A. As T is a square matrix, it admits an LUP decomposition Horn & Johnson (2012): i.e. a decomposition into the product of a lower triangular matrix L, an upper triangular matrix U , and a permutation matrix P . This proof proceeds essentially by showing how to construct the L, U , and P components in a constant number of our desired matrices. By Lemma 2, we can produce a matrix P with det P > 0 which agrees with P up to the sign of its entries using O(1) linear affine coupling layers. Then T P -1 is a matrix which admits an LU decomposition: for example, given that we know T P -1 has an LU decomposition, we can modify flip the sign of some entries of U to get an LU decomposition of T P -1 . Furthermore, since det(T P -1 ) > 0, we can choose an LU decomposition T P -1 = LU such that det(L), det(U ) > 0 (for any decomposition which does not satisfy this, the two matrices L and U must both have negative determinant as 0 < det(T P -1 ) = det(L) det(U ). In this case, we can flip the sign of column i in L and row i in U to make the two matrices positive determinant). It remains to show how to construct a lower/upper triangular matrix with positive determinant out of our matrices. We show how to build such a lower triangular matrix L as building U is symmetrical.

At this point we have a matrix

A 0 B C , where A and C are lower triangular. We can use column elimination to eliminate the bottom-left block: A 0 B C I 0 -C -1 B I = A 0 0 C , where A and C are lower-triangular. Recall from equation 6 that we can perform the signed swap operation in R 2 of taking (x, y) → (y, -x) for x using 3 affine coupling blocks. Therefore using 6 affine coupling blocks we can perform a sign flip map (x, y) → (-x, -y). Note that because det(L) > 0, the number of negative entries in the first d diagonal entries has the same parity as the number of negative entries in the second d diagonal entries. Therefore, using these sign flips in parallel, we can ensure using 6 affine coupling layers that that the first d and last d diagonal entries of L have the same number of negative elements. Now that the number of negative entries match, we can apply two diagonal rescalings to ensure that: 1. The first d diagonal entries of the matrix are distinct. 2. The last d diagonal entries contain the multiplicative inverses of the first d entries up to reordering. Here we use that the number of negative elements in the first d and last d elements are the same, which we ensured earlier. At this point, we can apply Lemma 3 to construct this matrix from four of our desired matrices. Since this shows we can build L and U , this shows we can build any matrix with positive determinant. Now, let's count the matrices we needed to accomplish this. In order to construct P , we needed 21 matrices. To construct L, we needed 1 for column elimination, 6 for the sign flip, 2 for the rescaling of diagonal elements, and 4 for Lemma 3 giving a total of 13. So, we need 21 + 13 + 13 = 47 total matrices to construct the whole LU P decomposition.

B.2 LOWER BOUND

Finally, we proceed to give the proof of Lemma 4. Proof of Lemma 4. We explicitly solve the block matrix equations. Multiplying out the LHS gives C D AC AD + B G H EG EH + F = CG + DEG CH + DEH + DF ACG + ADEG + BEG ACH + ADEH + ADF + BEH + BF . Say T = X Y Z W . Starting with the top-left block gives that X = (C + DE)G D = (XG -1 -C)E -1 Next, the top-right block gives that Y = (C + DE)H + DF = XG -1 H + DF H = GX -1 (Y -DF ). Equivalently, D = (Y -XG -1 H)F -1 Combining equation 8 and equation 7 gives H = GX -1 (Y -(XG -1 -C)E -1 F ) H = GX -1 Y -(I -GX -1 C)E -1 F The bottom-left and equation 7 gives Z = ACG + ADEG + BEG ZG -1 = AC + (AD + B)E E = (AD + B) -1 (ZG -1 -AC) (11) E = (A(XG -1 -C)E -1 + B) -1 (ZG -1 -AC) E -1 = (ZG -1 -AC) -1 (A(XG -1 -C)E -1 + B) (ZG -1 -AC) = (A(XG -1 -C)E -1 + B)E = A(XG -1 -C) + BE E = B -1 ((ZG -1 -AC) -A(XG -1 -C)) E = B -1 (ZG -1 -AXG -1 ) Taking the bottom-right block and substituting equation 11 gives W = ACH + (AD + B)(EH + F ) = ACH + (ZG -1 -AC)H + (AD + B)F W = ZG -1 H + ADF + BF. Substituting equation 7 into equation 13 gives W = ZG -1 H + A(Y -XG -1 H) + BF = (Z -AX)G -1 H + AY + BF. Substituting equation 10 gives W = (Z -AX)G -1 (GX -1 Y -(I -GX -1 C)E -1 F ) + AY + BF = (Z -AX)(X -1 Y -(G -1 -X -1 C)E -1 F ) + AY + BF. Substituting equation 12 gives W = (Z -AX)(X -1 Y -(G -1 -X -1 C)(ZG -1 -AXG -1 ) -1 BF ) + AY + BF W -ZX -1 Y -BF = (Z -AX)(X -1 C -G -1 )((Z -AX)G -1 ) -1 BF = (Z -AX)(X -1 C -G -1 )G(Z -AX) -1 BF = (Z -AX)X -1 C(Z -AX) -1 -BF W -ZX -1 Y = (Z -AX)X -1 C(Z -AX) -1 (14) Here we notice that W -ZX -1 Y is similar to X -1 C, where we get to choose values along the diagonal of C. In particular, this means that W -ZX -1 Y and X -1 C must have the same eigenvalues. Proof of Theorem 6. First, note that element in A 4 can be written in either the form L 1 R 1 L 2 R 2 or R 1 L 1 R 2 L 2 for L 1 , L 2 ∈ A L and R 1 , R 2 ∈ A R . We construct an explicit matrix which cannot be written in either form. Consider an invertible matrix of the form T = X 0 0 W and observe that the Schur complement T /X is simply W . Therefore Lemma 4 says that this matrix can only be in A L A R A L A R if W is similar to X -1 C for some diagonal matrix C. Now consider the case where W is a permutation matrix encoding the permutation (1 2 • • • d) and X is a diagonal matrix with nonzero entries. Then X -1 C is a diagonal matrix as well, hence has real eigenvalues, while the eigenvalues of W are the d-roots of unity. (The latter claim follows because for any ζ with ζ d = 1, the vector (1, ζ, • • • , ζ d-1 ) is an eigenvector of W with eigenvalue ζ). Since similar matrices must have the same eigenvalues, it is impossible that X -1 C and W are similar. The remaining possibility we must consider is that this matrix is in A R A L A R A L . In this case by applying the symmetrical version of Lemma 4 (which follows by swapping the first n and last n coordinates), we see that W -1 C and X must be similar. Since Tr(W -1 C) = 0 and Tr(X) > 0, this is impossible.

C EXPERIMENTAL VERIFICATION C.1 PARTITIONED LINEAR NETWORKS

In this section, we will provide empirical support for Theorems 2 and 3. More precisely, empirically, the number of required linear affine coupling layers at least for random matrices seems closer to the lower bound -so it's even better than the upper bound we provide. Setup We consider the following synthetic setup. We train n layers of affine coupling layers, namely networks of the form f n (z) = n i=1 E i C i D i 0 I I 0 A i B i with E i , B i , C i diagonal. Notice the latter two follow the statement of Theorem 2 and the alternating order of upper vs lower triangular matrices can be assumed without loss of generality, as a product of upper/lower triangular matrices results in an upper/lower triangular matrix. The matrices E i turn out to be necessary for training -they enable "renormalizing" the units in the network (in fact, Glow uses these and calls them actnorm layers; in older models like RealNVP, batchnorm layers are used instead). The training data is of the form Az, where z ∼ N (0, I) for a fixed d × d square matrix A with either random standard Gaussian entries in Figures 4 to 8 or random standard Gaussian entries that are diagonal-constant (that latter giving a natural random ensemble of Toeplitz matrices) in Figures 9 to 13. This ensures that there is a "ground" truth linear model that fits the data well. 8 We then train the affine coupling network by minimizing the loss E z∼N (0,I) (f n (z) -Az) 2 and trained on a variety of values for n and d in order to investigate how the depth of linear networks affects the ability to fit linear functions of varying dimension. Note, we are not training via maximum likelihood, but rather we are minimizing a "supervised" loss, wherein the network f n "knows" which point x a latent z is mapped to. This is intentional and is meant to separate the representational vs training aspect of different architectures. Namely, this objective is easier to train, and our results address the representational aspects of different architectures of flow networks -so we wish our experiments to be confounded as little as possible by aspects of training dynamics. We chose n = 1, 2, 4, 8, 16 layers and d = 4, 8, 16, 32, 64 dimensions (here a layer is one matrix and not a flipped pair as in our theoretical results). We present the standard L2 training loss and the squared Frobenius error of the recovered matrix Â obtained by multiplying out the linear layers || Â -A|| 2 F , both normalized by 1/d 2 so that they are independent of dimensionality. We shade the standard error regions of these losses across the seeds tried. All these plots are log-scale, so the noise seen lower in the charts is very small. We initialize the E, C, B matrices with 1s on the diagonal and A, D with random Gaussian elements with σ = 10 -5 and train with Adam with learning rate 10 -4 . We train on 5 random seeds which affect the matrix A generated and the datapoints z sampled. Finally, we also train similar RealNVP models on the same datasets, using a regression objective as done with the PLNs but s and t networks with two hidden layers with 128 units and the same numbers of couplings as with the PNN experiments.

Results

The results demonstrate that the 1-and 2-layer networks fail to fit even coarsely any of the linear functions we tried. Furthermore, the 4-layer networks consistently under-perform compared to the 8-and 16-layer networks. The 8-and 16-layer networks seem to perform comparably, though we note the larger mean error for d=64, which suggests that the performance can potentially be further improved (either by adding more layers, or improving the training by better choice of hyperparameters; even on this synthetic setup, we found training of very deep networks to be non-trivial). These experimental results suggest that at least for random linear transformations T , the number of required linear layers is closer to the lower bound. Moreover, the error for the Toeplitz ensemble is slightly larger, indicating this distribution is slightly harder. Closing this gap (both in a worst-case and distributional sense) is an interesting question for further work. In our experiments with the RealNVP architecture, we observe more difficulty in fitting these linear maps, as they seem to need more training data to reach similar levels or error. We hypothesize this is due to the larger model class that comes with allowing nonlinear functions in the couplings.

C.2 ADDITIONAL PADDING RESULTS ON SYNTHETIC DATASETS

We provide further results on the performance of Real NVP models on datasets with different kinds of padding (no padding, zero-padding and Gaussian padding) on standard synthetic datasets-Swissroll, 2 Moons and Checkerboard. The results are consistent with the performance on the mixture of 4 Gaussians: in Figures 24, 25 , and 26, we see that the zero padding greatly degrades the conditioning and somewhat degrades the visual quality of the learned distribution. On the other hand, Gaussian padding consistently performs best, both in terms of conditioning of the Jacobian, and in terms of the quality of the recovered distribution. First (as a warmup), we give a much simpler proof than Huang et al. (2020) that affine coupling networks are universal approximators in Wasserstein under zero-padding, which moreover shows that only a small number of affine coupling layers are required. For Q a probability measure over R n satisfying weak regularity conditions (see Theorem 9 below), by Brenier's Theorem Villani (2003) there a W 2 optimal transport map ϕ : R n → R n such that if X ∼ N (0, I n×n ), then the pushforward ϕ # (X) is distributed according to Q, and a corresponding transport map in the opposite direction which we denote ϕ -1 . If we allow for arbitrary functions t in the affine coupling network, then we can implement the zero-padded transport map (X, 0) → (ϕ(X), 0) as follows: (X, 0) → (X, ϕ(X)) → (ϕ(X), ϕ(X)) → (ϕ(X), 0). Explicitly, in the first layer the translation map is t 1 (x) = ϕ(x), in the second layer the translation map is t 2 (x) = x -ϕ -1 (x), and in the third layer the translation map is t 3 (x) = -x. Note that no scaling maps are required: with zero-padding the basic NICE architecture can be universal, unlike in the unpadded case where NICE can only hope to implement volume preserving maps. This is because every map from zero-padded data to zero-padded data is volume preserving. Finally, if we are required to implement the translation maps using neural networks, we can use standard approximation-theoretic results for neural networks, combined with standard results from optimal transport, to show universality of affine coupling networks in Wasserstein. First, we recall the formal statement of Brenier's Theorem: Theorem 9 (Brenier's Theorem, Theorem 2.12 of Villani (2003) ). Suppose that P and Q are probability measures on R n with densities with respect to the Lebesgue measure. Then Q = (∇ψ) # P for ψ a convex function, and moreover ∇ψ is the unique W 2 -optimal transport map from P to Q. It turns out that the transportation map ϕ := ∇ψ is not always a continuous function, however there are simple sufficient conditions for the distribution Q under which the map is continuous (see e.g. Caffarelli (1992)). From these results (or by directly smoothing the transport map), we know any distribution with bounded support can be approached in Wasserstein distance by smooth pushforwards of Gaussians. So for simplicity, we state the following Theorem for distributions which are the pushforward of smooth maps. Theorem 10 (Universal approximation with zero-padding). Suppose that P is the standard Gaussian measure in R n and Q = ϕ # P is the pushforward of the Gaussian measure through ϕ and ϕ is a smooth map. Then for any > 0 there exists a depth 3 affine coupling network g with no scaling and feedforward ReLU net translation maps such that W 2 (g # (P × δ 0 n ), Q × δ 0 n ) ≤ . Proof. For any M > 0, let f M (x) = min(M, max(-M, x)) be the 1-dimensional truncation map to [-M, M ] and for a vector x ∈ R n let f M (x) ∈ [-M, M ] n be the result of applying f M coordinate-wise. Note that f M can be implemented as a ReLU network with two hidden units per input dimension. Also, any continuous function on [-M, M ] n can be approximated arbitrarily well in L ∞ by a sufficiently large ReLU neural network with one hidden layer Leshno et al. (1993) . Finally, note that if f -g L ∞ ≤ then for any distribution P we have W 2 (f # P, g # P ) ≤ by considering the natural coupling that feeds the same input into f and g. Now we show how to approximate the construction of equation 15 using these tools. For any > 0, if we choose M sufficiently large and then take φ and ϕ -1 to be sufficiently good approximations of ϕ and ϕ -1 on [-M, M ] n , we can construct an affine coupling network with ReLU feedforward network translation maps t1 (x) = f M ( φ(f M (x))), t2 (x) = x -ϕ -1 (x), and t3 (x) = -x, such that the output has W 2 distance at most from Q. Universality without padding. We now show that universality in Wasserstein can be proved even if we don't have zero-padding, using a lattice-based encoding and decoding scheme. Let > 0 be a small constant, to be taking sufficiently small. Let ∈ (0, ) be a further constant, taken sufficiently small with respect to and similar for wrt . Suppose the input dimension is 2n, and let X = (X 1 , X 2 ) with independent X 1 ∼ N (0, I n×n ) and X 2 ∼ N (0, I n×n ) be the input the the affine coupling network. Let f (x) be the map which rounds x ∈ R n to the closest grid point in Z n and define g(x) = x -f (x). Note that for a point of the form z = f (x) + y for y which is not too large, we have that f (z) = f (x) and g(z) = y. Let ϕ 1 , ϕ 2 be the desired transportation maps guaranteed by Brenier's theorem, so that the distribution of ϕ 1 (X) is the target distribution Q and ϕ 2 (X) is a standard Gaussian independent of ϕ 1 (X). (In other words, ϕ 1 , ϕ 2 correspond to the first half and second half of the output coordinates of the transport map from the 2n dimensional standard Gaussian to the desired padded distribution.) Now we consider the following sequence of maps: (X 1 , X 2 ) → (X 1 , X 2 + f (X 1 )) (16) → (f (ϕ 1 (f (X 1 ), X 2 )) + ϕ 2 (f (X 1 ), X 2 ) + O( ), X 2 + f (X 1 )) (17) → (f (ϕ 1 (f (X 1 ), X 2 )) + ϕ 2 (f (X 1 ), X 2 ) + O( ), ϕ 2 (f (X 1 ), X 2 ) + O( / )). More explicitly, in the first step we take s 1 (x) = log( ) 1 and t 1 (x) = f (x). In the second step, we take s 2 (x) = log( ) 1 and t 2 is defined by t 2 (x) = f (ϕ 1 (f (x), g(x))) + ϕ 2 (f (x), g(x) ). In the third step, we take s 3 (x) = log( ) 1 and define t 3 (x) = g(x) . Again, taking sufficiently good approximations to all of the maps allows to approximate this map with neural networks, which we formalize below. Proof of Theorem 4. Turning equation 16,equation 17, and equation 18 into a universal approximation theorem for ReLU-net based feedforward networks just requires to modify the proof of Theorem 10 for this scenario. Fix δ > 0, the above argument shows we can choose , , > 0 sufficiently small so that if h is map defined by composing equation 16,equation 17, and equation 18, then W 2 (h # P, Q) ≤ /4. The layers defining h may not be continuous, since f is only continuous almost everywhere. Using that continuous functions are dense in L 2 , we can find a function f which is continuous and such that if we define h by replacing each application of f by f , then W 2 (h #P, Q) ≤ /2. Finally, since f is an affine coupling network with continuous s and t functions, we can use the same truncation-and-approximation argument from Theorem 10 to approximate it by an affine coupling network g with ReLU feedforward s and t functions such that W 2 (g#P, Q) ≤ , which proves the result.

E APPROXIMATING ENTRYWISE NONLINEARITY WITH AFFINE COUPLINGS

To show how surprisingly hard it may be to represent even simple functions using affine couplings, we show an example of a very simple function-an entrywise application of hyperbolic tangent, s.t. an arbitrary depth/width sequence of affine coupling blocks with tanh nonlinearities cannot exactly represent it. Thus, even for simple functions, the affine-coupling structure imposes nontrivial restrictions. Note that in contrast to Theorem 4, we are considering exact representation here. Precisely, we show: The proof of the theorem is fairly unusual, as it uses some tools from complex analysis in several variables (see Grauert & Fritzsche (2012) for a reference) -though it's so short that we include it here. The result also generalizes to other neural networks with analytic activations. Proof of Theorem 11. By compactness of the class of models bounded by W, D, N, R, it suffices to prove that there is no way to exactly represent the function. Suppose for contradiction that f = g on the entirety of [-1, 1] d . Let z 1 , . . . , z d denote the d inputs to the function: we now consider the behavior of f and g when we extend their definition to C d . From the definition, g extends to a holomorphic function (of several variables) on all of C d \ {z : ∃j, z j = iπ(k + 1/2) : k ∈ Z}, i.e. everywhere where tanh doesn't have a pole. Similarly, there exists an dense open subset D ⊂ C d on which the affine coupling network f is holomorphic, because it is formed by the addition, multiplication, and composition of holomorphic functions. We next prove that f = g on their complex extensions by the Identity Theorem (Theorem 4.1 of Grauert & Fritzsche (2012) ). We must first show that f = g on an open subset of C d . To prove this, observe that f is analytic at zero and its power series expansion is uniquely defined in terms of the values of f on R d (for example, we can compute the coefficients by taking partial derivatives). It follows that the power series expansions of f and g are both equal at zero and convergent in an open neighborhood of 0 in C d , so we can indeed apply the Identity Theorem; this shows that f = g wherever they are both defined. From the definition tanh(z) = e 2z -1 e 2z +1 we can see that g is periodic in the sense that g(z +πik) = g(z) for any k ∈ Z d . However, by construction the affine coupling network f is invertible whenever, at every layer, the output of the function a is not equal to zero. By the identity theorem, the set of inputs where each a vanishes is nowhere dense -otherwise, by continuity a vanishes on the open neighborhood of some point, so a = 0 by the Identity Theorem which contradicts the assumption. Therefore the union of inputs where a at any layer vanishes is also nowhere dense. Consider the behavior of f on an open neighborhood of 0 and of iπ: we have shown that f is invertible except on a nowhere dense set, and also that g = f wherever f is defined, but g(z) = g(z + iπ) so it's impossible for f to be invertible on these neighborhoods except on a nowhere dense subset. By contradiction, f = g on [-1, 1] d . Finally, to give empirical evidence that the above is not merely a theoretical artifact, we regress an affine coupling architecture to fit entrywise tanh. Specifically, we sample 10-dimensional vectors from a standard Gaussian distribution and train networks as in the padding section on a squared error objective such that each input is regressed on its elementwise tanh. We train an affine coupling network with 5 pairs of alternating couplings with g and h networks consisting of 2 hidden layers with 128 units each. For comparison, we also regress a simple MLP with 2 hidden layers with 128 units in each layer, exactly one of the g or h subnetworks from the coupling architecture, which contains 20 such subnetworks. For another comparison, we also try this on the elementwise ReLU function, using affine couplings with tanh activations and the same small MLP. As we see in Figure 3 , the affine couplings fit the function substantially worse than a much smaller MLP -corroborating our theoretical result. 



Note for architectures having trainable biases in the input layer, these two notions of Lipschitzness should be expected to behave similarly. In this Theorem and throughout, we use the standard asymptotic notation f (d) = o(g(d)) to indicate Note, our theorem applies to exponentially large Lipschitz constants. Alternatively, we could feed a degenerate Gaussian supported on a d-dimensional subspace into the network as input, but there is no way to train such a model using maximum-likelihood training, since the prior is degenerate. The size of g doesn't indeed depend on . The weights in the networks will simply grow as becomes small. The size of g doesn't indeed depend on . The weights in the networks will simply grow with . This proof, given by HH Rugh, and some other ways to prove this result can be found at https://math.stackexchange.com/questions/1871783/ every-permutation-is-a-product-of-two-permutations-of-order-2 . As a side remark, this ground truth is only specified up to orthogonal matrices U , as AU z is identically distributed to Az, due to the rotational invariance of the standard Gaussian.



that lim sup d→∞ f (d) g(d) = 0. For example, the assumption k, L, R = exp(o(d)) means that for any sequence(k d , L d , R d ) ∞ d=1 such that lim sup d→∞ max(log k d ,log L d ,log R d ) d= 0 the result holds true.

Figure 2: Fitting a 4-component mixture of Gaussians using a RealNVP model with no padding, zero padding and Gaussian padding.

UNIVERSAL APPROXIMATION WITH ILL-CONDITIONED AFFINE COUPLING NETWORKS D.1 SIMPLER UNIVERSALITY UNDER ZERO-PADDING.

Let d ≥ 2 and denote g : R d → R d , g(z) := (tanh z 1 , . . . , tanh z d ). Then, for any W, D, N ∈ N and norm • , there exists an ε(W, D, N ) > 0, s.t. for any network f consisting of a sequence of at most N affine coupling layers of the form: (y S , y S ) → (y S , y S a(y S ) + b(y S )) for in each layer an arbitrary set S [d] and a, b arbitrary feed-forward tanh neural networks of width at most W , depth at most D, and weight norm into each unit of at most R, it holds that E x∈[-1,1] d f (x) -g(x) > ε(W, D, N, R).

Figure 3: The smaller MLPs are much better able to fit simple elementwise nonlinearities than the affine couplings.

Figure 4: Learning Partitioned Linear Networks on 4-D linear functions.

Figure 5: Learning Partitioned Linear Networks on 8-D linear functions.

Figure 6: Learning Partitioned Linear Networks on 16-D linear functions.

Figure 7: Learning Partitioned Linear Networks on 32-D linear functions.

Figure 8: Learning Partitioned Linear Networks on 64-D linear functions.

Figure 9: Learning Partitioned Linear Networks on 4-D Toeplitz functions.

Figure 10: Learning Partitioned Linear Networks on 8-D Toeplitz functions.

Figure 11: Learning Partitioned Linear Networks on 16-D Toeplitz functions.

Figure 12: Learning Partitioned Linear Networks on 32-D Toeplitz functions.

Figure 13: Learning Partitioned Linear Networks on 64-D Toeplitz functions.

Figure 14: Real NVP Regressed on 4-D Linear Functions

Figure 18: Real NVP Regressed on 64-D Linear Functions

So if we choose δ = 2M

√ dk , we have the desired bound in W 1 . (We note, making δ small only manifests in the size of the weights of the functions 1, and not in the size of the network itself. This is obvious from the construction in Lemma 3 in Arora et al. (2017) .) Proceeding to subgaussianity, consider a 1-Lipschitz function ϕ centered such that E[(ϕ • f ) #N ] = 0. Next, we'll show that (ϕ • f ) #N is subgaussian with an appropriate constant. We can view f #N as the sum of two random variables: γz andγz is a Gaussian with covariance γ 2 I. The other term is contained in an l 2 ball of radius M . Using the Lipschitz property and Lipschitz concentration for Gaussians (Theorem 5.2.2 of Vershynin ( 2018)), we see that. By considering separately the cases |t| ≤ 2M and |t| > 2M , we immediately see this implies that the pushforward is O(γ 2 + M 2 )-subgaussian. Since M 2 = O(γ 2 d), the claim follows.

A.5 KL DIVERGENCE BOUNDS

In this section, we use the Bobkov-Goetze inequality to derive the KL divergence bounds from the Wasserstein bounds.

Concretely:

Theorem 8 (Bobkov & Götze (1999)). Let p, q : R d → R + be two distributions s.t. for every 1-Lipschitz f : R d → R + and X ∼ p, f (X) is c 2 -subgaussian. Then, we have KL(q, p) ≥Then, to finish the two inequalities in the statement of the main theorem, we will show that:• For any mixture of k Gaussians where the component means µ i satisfy µ i ≤ M , the condition of Theorem 8 is satisfied with c 2 = O(γ 2 + M 2 ). (In fact, we show this for the pushforward through g, the neural network which approximates the mixture, which poses some non-trivial technical challenges. See Appendix A, Lemma 9.) • A pushforward of the standard Gaussian through a L-Lipschitz generator f satisfies the conditions of Theorem 8 with c 2 = L 2 , which implies the second part of the claim. (Theorem 5.2.2 in Vershynin (2018).)B MISSING PROOFS FOR SECTION 5B.1 UPPER BOUND First, we recall a folklore result about permutations. Let S n denote the symmetric group on n elements, i.e. the set of permutations of {1, . . . , n} equipped with the multiplication operation of composition. Recall that the order of a permutation π is the smallest positive integer k such that π k is the identity permutation. Lemma 10. For any permutation π ∈ S n , there exists σ 1 , σ 2 ∈ S n of order at most 2 such thatProof. This result is folklore. We include a proof of it for completeness 7 .First, recall that every permutation π has a unique decomposition π = c 1 • • • c k as a product of disjoint cycles. Therefore if we show the result for a single cycle, so c i = σ i1 σ i2 for every i, then 

