REPRESENTATIONAL ASPECTS OF DEPTH AND CONDI-TIONING IN NORMALIZING FLOWS

Abstract

Normalizing flows are among the most popular paradigms in generative modeling, especially for images, primarily because we can efficiently evaluate the likelihood of a data point. This is desirable both for evaluating the fit of a model, and for ease of training, as maximizing the likelihood can be done by gradient descent. However, training normalizing flows comes with difficulties as well: models which produce good samples typically need to be extremely deep -which comes with accompanying vanishing/exploding gradient problems. A very related problem is that they are often poorly conditioned: since they are parametrized as invertible maps from R d → R d , and typical training data like images intuitively is lowerdimensional, the learned maps often have Jacobians that are close to being singular. In our paper, we tackle representational aspects around depth and conditioning of normalizing flows-both for general invertible architectures, and for a particular common architecture-affine couplings. For general invertible architectures, we prove that invertibility comes at a cost in terms of depth: we show examples where a much deeper normalizing flow model may need to be used to match the performance of a non-invertible generator. For affine couplings, we first show that the choice of partitions isn't a likely bottleneck for depth: we show that any invertible linear map (and hence a permutation) can be simulated by a constant number of affine coupling layers, using a fixed partition. This shows that the extra flexibility conferred by 1x1 convolution layers, as in GLOW, can in principle be simulated by increasing the size by a constant factor. Next, in terms of conditioning, we show that affine couplings are universal approximators -provided the Jacobian of the model is allowed to be close to singular. We furthermore empirically explore the benefit of different kinds of paddinga common strategy for improving conditioning.

1. INTRODUCTION

Deep generative models are one of the lynchpins of unsupervised learning, underlying tasks spanning distribution learning, feature extraction and transfer learning. Parametric families of neural-network based models have been improved to the point of being able to model complex distributions like images of human faces. One paradigm that has received a lot attention is normalizing flows, which model distributions as pushforwards of a standard Gaussian (or other simple distribution) through an invertible neural network G. Thus, the likelihood has an explicit form via the change of variables formula using the Jacobian of G. Training normalizing flows is challenging due to a couple of main issues. Empirically, these models seem to require a much larger size than other generative models (e.g. GANs) and most notably, a much larger depth. This makes training challenging due to vanishing/exploding gradients. A very related problem is conditioning, more precisely the smallest singular value of the forward map G. It's intuitively clear that natural images will have a low-dimensional structure, thus a close-to-singular G might be needed. On the other hand, the change-of-variables formula involves the determinant of the Jacobian of G -1 , which grows larger the more singular G is. While recently, the universal approximation power of various types of invertible architectures has been studied (Dupont et al., 2019; Huang et al., 2020) if the input is padded with a sufficiently large number of all-0 coordinates, precise quantification of the cost of invertibility in terms of the depth required and the conditioning of the model has not been fleshed out. In this paper, we study both mathematically and empirically representational aspects of depth and conditioning in normalizing flows and answer several fundamental questions.

2.1. RESULTS ABOUT GENERAL ARCHITECTURES

In order to guarantee that the network is invertible, normalizing flow models place significant restrictions on the architecture of the model. The most basic question we can ask is how this restriction affects the expressive power of the model -in particular, how much the depth must increase to compensate. More precisely, we ask: Question 1: is there a distribution over R d which can be written as the pushforward of a Gaussian through a small, shallow generator, which cannot be approximated by the pushforward of a Gaussian through a small, shallow layerwise invertible neural network? Given that there is great latitude in terms of the choice of layer architecture, while keeping the network invertible, the most general way to pose this question is to require each layer to be a function of p parameters -i.e. f = f 1 •f 2 •• • ••f where • denotes function composition and each f i : R d → R d is an invertible function specified by a vector θ i ∈ R p of parameters. This framing is extremely general: for instance it includes layerwise invertible feedforward networks in which  f i (x) = σ ⊗d (A i x + b i ), σ is invertible, A i ∈ R d×d is invertible, θ i = (A i i (x Si , x [d]\Si ) = (x Si , x [d]\Si g i (x Si ) + h i (x Si )) for some S ⊂ [d] which we revisit in more detail in the following subsection. We answer this question in the affirmative: namely, we show for any k that there is a distribution over R d which can be expressed as the pushforward of a network with depth O(1) and size O(k) that cannot be (even very approximately) expressed as the pushforward of a Gaussian through a Lipschitz layerwise invertible network of depth smaller than k/p. Towards formally stating the result, let θ = (θ 1 , . . . , θ ) ∈ Θ ⊂ R d be the vector of all parameters (e.g. weights, biases) in the network, where θ i ∈ R p are the parameters that correspond to layer i, and let f θ : R d → R d denote the resulting function. Define R so that Θ is contained in the Euclidean ball of radius R. We say the family f θ is L-Lipschitz with respect to its parameters and inputs, if 1 We will discuss the reasonable range for L in terms of the weights after the Theorem statement. We show 2 : Theorem 1. For any k = exp(o(d)), L = exp(o(d)), R = exp(o(d)), we have that for d sufficiently large and any γ > 0 there exists a neural network g : R d+1 → R d with O(k) parameters and depth O(1), s.t. for any family {f θ , θ ∈ Θ} of layerwise invertible networks that are L-Lipschitz with respect to its parameters and inputs, have p parameters per layer and depth at most k/p we have ∀θ, θ ∈ Θ : E x∼N (0,I d×d ) f θ (x) -f θ (x) ≤ L θ -θ and ∀x, y ∈ R d , f θ (x) -f θ (y) ≤ L x -y . ∀θ ∈ Θ, W 1 ((f θ ) #N , g #N ) ≥ 10γ 2 d Furthermore, for all θ ∈ Θ, KL((f θ ) #N , g #N ) ≥ 1/10 and KL(g #N , (f θ ) #N ) ≥ 10γ 2 d L 2 . Remark 1: First, note that while the number of parameters in both networks is comparable (i.e. it's O(k)), the invertible network is deeper, which usually is accompanied with algorithmic difficulties for training, due to vanishing and exploding gradients. For layerwise invertible generators, if we assume that the nonlinearity σ is 1-Lipschitz and each matrix in the network has operator norm at most M , 1 Note for architectures having trainable biases in the input layer, these two notions of Lipschitzness should be expected to behave similarly. 



, b i ) and p = d(d + 1). It also includes popular architectures based on affine coupling blocks (e.g. Dinh et al. (2014; 2016); Kingma & Dhariwal (2018)) where each f i has the form f

In this Theorem and throughout, we use the standard asymptotic notation f (d) = o(g(d)) to indicate that lim sup d→∞ f (d) g(d) = 0. For example, the assumption k, L, R = exp(o(d)) means that for any sequence (k d , L d , R d ) ∞ d=1 such that lim sup d→∞ max(log k d ,log L d ,log R d ) d = 0 the result holds true.

