LEARNING AND GENERALIZATION IN UNIVARIATE OVERPARAMETERIZED NORMALIZING FLOWS

Abstract

In supervised learning, it is known that overparameterized neural networks with one hidden layer provably and efficiently learn and generalize, when trained using Stochastic Gradient Descent (SGD). In contrast, the benefit of overparameterization in unsupervised learning is not well understood. Normalizing flows (NFs) learn to map complex real-world distributions into simple base distributions and constitute an important class of models in unsupervised learning for sampling and density estimation. In this paper, we theoretically and empirically analyze these models when the underlying neural network is one hidden layer overparametrized network. On the one hand, we provide evidence that for a class of NFs, overparametrization hurts training. On the other hand, we prove that another class of NFs, with similar underlying networks, can efficiently learn any reasonable data distribution under minimal assumptions. We extend theoretical ideas on learning and generalization from overparameterized neural networks in supervised learning to overparameterized normalizing flows in unsupervised learning. We also provide experimental validation to support our theoretical analysis in practice.

1. INTRODUCTION

Neural network models trained using simple first-order iterative algorithms have been very effective in both supervised and unsupervised learning. Theoretical reasoning of this phenomenon requires one to consider simple but quintessential formulations, where this can be demonstrated by mathematical proof, along with experimental evidence for the underlying intuition. First, the minimization of training loss is typically a non-smooth and non-convex optimization over the parameters of neural networks, so it is surprising that neural networks can be trained efficiently by first-order iterative algorithms. Second, even large neural networks whose number parameters are more than the size of training data often generalize well with a small loss on the unseen test data, instead of overfitting the seen training data. Recent work in supervised learning attempts to provide theoretical justification for why overparameterized neural networks can train and generalize efficiently in the above sense. In supervised learning, the empirical risk minimization with quadratic loss is a non-convex optimization problem even for a fully connected neural network with one hidden layer of neurons with ReLU activations. Around 2018, it was realized that when the hidden layer size is large compared to the dataset size or compared to some measure of complexity of the data, one can provably show efficient training and generalization for these networks, e.g. Jacot et al. (2018) ; Li & Liang (2018) ; Du et al. (2018) ; Allen-Zhu et al. (2019) ; Arora et al. (2019) . Of these, Allen-Zhu et al. (2019) is directly relevant to our paper and will be discussed later. The role of overparameterization, and provable training and generalization guarantees for neural networks are less well understood in unsupervised learning. Generative models or learning a data distribution from given samples is an important problem in unsupervised learning. Popular generative models based on neural networks include Generative Adversarial Networks (GANs) (e.g., Goodfellow et al. (2014) ), Variational AutoEncoders (VAEs) (e.g., Kingma & Welling (2014) ), and Normalizing Flows (e.g., Rezende & Mohamed (2015) ). GANs and VAEs have shown impressive capability to generate samples of photo-realistic images but they cannot give probability density estimates for new data points. Training of GANs and VAEs has various additional challenges such as mode collapse, posterior collapse, vanishing gradients, training instability, etc. as shown in e.g. Bowman et al. (2016) ; Salimans et al. (2016) ; Arora et al. (2018) ; Lucic et al. (2018) . In contrast to the generative models such as GANs and VAEs, when normalizing flows learn distributions, they can do both sampling and density estimation, leading to wide-ranging applications as mentioned in the surveys by Kobyzev et al. (2020) and Papamakarios et al. (2019) . Theoretical understanding of learning and generalization in normalizing flows (more generally, generative models and unsupervised learning) is a natural and important open question, and our main technical contribution is to extend known techniques from supervised learning to make progress towards answering this question. In this paper, we study learning and generalization in the case of univariate overparameterized normalizing flows. Restriction to the univariate case is technically non-trivial and interesting in its own right: univariate ReLU networks have been studied in recent supervised learning literature (e.g., Savarese et al. (2019) , Williams et al. (2019) , Sahs et al. (2020) and Daubechies et al. (2019) ). Multidimensional flows are qualitatively more complex and our 1D analysis sheds some light on them (see Sec. 4). Before stating our contributions, we briefly introduce normalizing flows; details appear in Section 2. Normalizing Flows. We work with one-dimensional probability distributions with continuous density. The general idea behind normalizing flows (NFs), restricted to 1D can be summarized as follows: Let X ∈ R be a random variable denoting the data distribution. We also fix a base distribution with associated random variable Z which is typically standard Gaussian, though in this paper we will work with the exponential distribution as well. Given i.i.d. samples of X, the goal is to learn a continuous strictly monotone increasing map f X : R → R that transports the distribution of X to the distribution of Z: in other words, the distribution of f -1 X (Z) is that of X. The learning of f X is done by representing it by a neural network and setting up an appropriate loss function. The monotonicity requirement on f which makes f invertible, while not essential, greatly simplifies the problem and is present in all the works we are aware of. It is not clear how to set up a tractable optimization problem without this requirement. Since the function represented by standard neural networks are not necessarily monotone, the design of the neural net is altered to make it monotone. For our 1D situation, one-hidden layer networks are of the form N (x) = m i=1 a i σ(w i x + b i ), where m is the size of the hidden layer and the a i , w i , b i are the parameters of the network. We will assume that the activation functions used are monotone. Here we distinguish between two such alterations: (1) Changing the parametrization of the neural network. This can be done in multiple ways: instead of a i , w i we use a 2 i , w 2 i (or other functions, such as the exponential function, of a i , w i that take on only positive values) (Huang et al., 2018; Cao et al., 2019) . This approach appears to be the most popular. In this paper, we also suggest another related alteration: we simply restrict the parameters a i , w i to be positive. This is achieved by enforcing this constraint during training. (2) Instead of using N (x) for f (x) we use φ(N (x)) for f (x) = df dx , where φ : R → R + takes on only positive values. Positivity of f implies monotonicity of f . Note that no restrictions on the parameters are required; however, because we parametrize f , the function f needs to be reconstructed using numerical quadrature. This approach is used by Wehenkel & Louppe (2019) . We will refer to the models in the first class as constrained normalizing flows (CNFs) and those in the second class as unconstrained normalizing flows (UNFs). Our Contributions. In this paper, we study both constrained and unconstrained univariate NFs theoretically as well as empirically. The existing analyses for overparametrized neural networks in the supervised setting work with a linear approximation of the neural network, termed pseudo network in Allen-Zhu et al. (2019) . They show that (1) there is a pseudo network with weights close to the initial ones approximating the target function, (2) the loss surfaces of the neural network and the pseudo network are close and moreover the latter is convex for convex loss functions. This allows for proof of the convergence of the training of neural network to global optima. One can try to adapt the approach of using a linear approximation of the neural network to analyze training of NFs. However, one immediately encounters some new roadblocks: the loss surface of the pseudo networks is non-convex in both CNFs and UNFs. In both cases, we identify novel variations that make the optimization problem for associated pseudo network convex: For CNFs, instead of using a 2 i , w 2 i as parameters, we simply impose the constraints a i ≥ and w i ≥ for some small constant . The optimization algorithm now is projected SGD, which in this case incurs essentially no extra cost over SGD due to the simplicity of the positivity constraints. Apart from making the optimization problem convex, in experiments this variation slightly improves the training of NFs compared to the reparametrization approaches, and may be useful in practical settings. Similarly, for UNFs we identify two changes from the model of Wehenkel & Louppe (2019) that make the associated optimization problem convex, while still retaining empirical effectiveness: (1) Instead of Clenshaw-Curtis quadrature employed in Wehenkel & Louppe (2019) which uses positive and negative coefficients, we use the simple rectangle quadrature which uses only positive coefficients. This change makes the model somewhat slow (it uses twice as many samples and time to get similar performance on the examples we tried). ( 2) Instead of the standard Gaussian distribution as the base distribution, we use the exponential distribution. In experiments, this does not cause much change. Our results point to a dichotomy between these two classes of NFs: our variant of UNFs can be theoretically analyzed when the networks are overparametrized to prove that the UNF indeed learns the data distribution. To our knowledge, this is the first "end-to-end" analysis of an NF model, and a neural generative model using gradient-based algorithms used in practice. This proof, while following the high-level scheme of Allen-Zhu et al. ( 2019) proof, has a number of differences, conceptual as well as technical, due to different settings. E.g., our loss function involves a function and its integral estimated by quadrature. On the other hand, for CNFs, our empirical and theoretical findings provide evidence that overparametrization makes training slower to the extent that models of similar size which learn the data distribution well for UNFs, fail to do so for CNFs. We also analyze CNFs theoretically in the overparametrized setting and point to potential sources of the difficulty. The case of moderatesized networks, where training and generalization do take place empirically, is likely to be difficult to analyze theoretically as presently this setting is open for the simpler supervised learning case. We hope that our results will pave the way for further progress. We make some remarks on the multidimensional case in Sec. 4. In summary, our contributions include: • To our knowledge, first efficient training and generalization proof for NFs (in 1D). • Identification of architectural variants of UNFs that admit analysis via overparametrization. • Identification of "barriers" to the analysis of CNFs. (2017) . Most variants of normalizing flows are specific to certain applications, and the expressive power (i.e., which base and data distributions they can map between) and complexity of normalizing flow models have been studied recently, e.g. Kong & Chaudhuri (2020) and Teshima et al. (2020) . Invertible transformations defined by monotonic neural networks can be combined into autoregressive flows that are universal density approximators of continuous probability distributions; see Masked Autoregressive Flows (MAF) Papamakarios et al. (2017) , UNMM-MAF by Wehenkel & Louppe (2019) , Neural Autoregressive Flows (NAF) by Huang et al. (2018) , Block Neural Autoregressive Flow (B-NAF) by Cao et al. (2019) . Unconstrained Monotonic Neural Network (UMNN) models proposed by Wehenkel & Louppe (2019) are particularly relevant to the technical part of our paper. Lei et al. (2020) show that when the generator is a two-layer tanh, sigmoid or leaky ReLU network, Wasserstein GAN trained with stochastic gradient descent-ascent converges to a global solution with polynomial time and sample complexity. Using the moments method and a learning algorithm motivated by tensor decomposition, Li & Dou (2020) show that GANs can efficiently learn a large class of distributions including those generated by two-layer networks. Nguyen et al. (2019b) show that two-layer autoencoders with ReLU or threshold activations can be trained with normalized gradient descent over the reconstruction loss to provably learn the parameters of any generative bilinear model (e.g., mixture of Gaussians, sparse coding model). Nguyen et al. (2019a) extend the work of Du et al. (2018) on supervised learning mentioned earlier to study weakly-trained (i.e., only encoder is trained) and jointly-trained (i.e., both encoder and decoder are trained) two-layer autoencoders, and show joint training requires less overparameterization and converges to a global optimum. The effect of overparameterization in unsupervised learning has also been of recent interest. Buhai et al. (2020) do an empirical study to show that across a variety of latent variable models and training algorithms, overparameterization can significantly increase the number of recovered ground truth latent variables. Radhakrishnan et al. (2020) show that overparameterized autoencoders and sequence encoders essentially implement associative memory by storing training samples as attractors in a dynamical system.

Related

Outline. A brief outline of our paper is as follows. Section 2 contains preliminaries and an overview of our results about constrained and unconstrained normalizing flows. Appendix B shows the existence of a pseudo network whose loss closely approximates the loss of the target function. Appendix C shows the coupling or closeness of their gradients over random initialization. Appendices D and E contain complete proofs of our optimization and generalization results, respectively. Section 3 and Appendix G contain our empirical studies towards validating our theoretical results.

2. PRELIMINARIES AND OVERVIEW OF RESULTS

We confine our discussion to the 1D case which is the focus of the present paper. The goal of NF is to learn a probability distribution given via i.i.d. samples data. We will work with distributions whose densities have finite support, and assumed to be [-1, 1], without loss of generality. Let X be the random variable corresponding to the data distribution we want to learn. We denote the probability density (we often just say density) of X at u ∈ R by p X (u). Let Z be a random variable with either standard Gaussian or the exponential distribution with λ = 1 (which we call standard exponential). Recall that the density of the standard exponential distribution at u ∈ R is given by e -u for u ≥ 0 and 0 for u < 0. Let f : R → R be a strictly increasing continuous function. Thus, f is invertible. We use f (x) = df dx to denote the derivative. Let p f,Z (•) be the density of the random variable f -1 (Z). Let x = f -1 (z), for z ∈ R. Then by the standard change of density formula using the monotonicity of f gives p f,Z (x) = p Z (z)f (x). (2.1) We would like to choose f so that p f,Z = p X , the true data density. It is known that such an f always exists and is unique; see e.g. Chapter 2 of Santambrogio (2015) . We will refer to the distribution of Z as the base distribution. Note that if we can find f , then we can generate samples of X using f -1 (Z) since generating the samples of Z is easy. Similarly, we can evaluate p X (x) = p Z (f -1 (z))f (x) using (2.1). To find f from the data, we set up the maximum log-likelihood objective: max f 1 n n i=1 log p f,Z (x i ) = max f 1 n n i=1 log p Z (f (x i )) + n i=1 log f (x i ) , where S = {x 1 , . . . , x n } ⊂ R contains i.i.d. samples of X, and the maximum is over continuous strictly increasing functions. When Z is standard exponential, the optimization problem (2.2) becomes min f L(f, S), where L(f, S) = 1 n x∈S L(f, x) and L(f, x) = f (x) -log f (x). (2.3) A similar expression, with f (x) 2 /2 replacing f (x), holds for the standard Gaussian. We denote the loss for standard Gaussian as L G (f, x). Informally, one would expect that as n → ∞, for the optimum f in the above optimization problems p f,Z → p X . To make the above optimization problem tractable, instead of f we use a neural network N . We consider one-hidden layer neural networks with the following basic form which will then be modified according to whether we are constraining the parameters or the output. N (x) = m r=1 a r0 ρ ((w r0 + w r ) x + (b r + b r0 )) . (2.4) Here m is the size of the hidden layer, ρ : R → R is a monotonically increasing activation function, the weights a r0 , w r0 , b r0 are the initial weights chosen at random according to some distribution, and w r , b r are offsets from the initial weights. We will only train the w r , b r and the a r0 will remain frozen to their initial values. Let θ = (W, B) ∈ R 2m denote the parameters W = (w ) the parameters at time step t = 1, 2, . . ., and the corresponding network by N t (x). The SGD updates are given by θ t+1 = θ t -η ∇ θ L s (N t , x t ) where η > 0 is learning rate, and L s (N t , x t ) is a loss function, and x t ∈ S is chosen uniformly randomly at each time step. For supervised learning where we are given labeled data {(x 1 , y 1 ), . . . , (x n , y n )}, one often works with the mean square loss L s (N t ) = 1 n n i=1 L s (N t , x i ) with L s (N t , x i ) = (N t (x i ) -y i ) 2 . We now very briefly outline the proof technique of Allen-Zhu et al. (2019) for analyzing training and generalization for one-hidden layer neural networks for supervised learning. (While they work in a general agnostic learning setting, for simplicity, we restrict the discussion to the realizable setting.) In their setting, the data x ∈ R d is generated by some distribution D and the labels y = h(x) are generated by some unknown function h : R d → R. The function h is assumed to have small "complexity" C h which in this case measures the required size of neural network with smooth activations to approximate h. The problem of optimizing the square loss is non-convex even for one-hidden layer networks. Allen-Zhu et al. (2019) instead work with pseudo network, P (x) which is the linear approximation of N (x) given by the first-order Taylor expansion of the activation: P (x) = m r=1 a r0 (σ(w r0 x + b r0 ) + σ (w r0 x + b r0 ) (w r x + b r )) . (2.5) Similarly to N t we can also define P t with parameters θ t . They observe that when the network is highly overparameterized, i.e. the network size m is sufficiently large compared to C h , and the learning rate is small, i.e. η = O(1/m), SGD iterates when applied to L(N t ) and L(P t ) remain close throughout. Moreover, the problem of optimizing L(P ) is a convex problem in θ and thus can be analyzed with existing methods. They also show an approximation theorem stating that with high probability there are neural network parameters θ * close to the initial parameters θ 0 such that the pseudo network with parameters θ * is close to the target function. This together with the analysis of SGD shows that the pseudo network, and hence the neural network too, achieves small training loss. Then by a Rademacher complexity argument they show that the neural network after T = O(C h / 2 ) time steps has population loss within of the optimal loss, thus obtaining a generalization result. We will now describe how to obtain neural networks representing monotonically increasing functions using the two different methods mentioned earlier, namely CNFs and UNFs.

2.1. CONSTRAINED NORMALIZING FLOW

Note that if we have a r0 ≥ 0, w r0 + w r ≥ 0 for all r, then the function represented by the neural network is monotonically increasing. We can ensure this positivity constraint by replacing a r0 and w r0 +w r by their functions that take on only positive values. For example, the function x → x 2 would give us the neural network N (x) = m r=1 a 2 r0 ρ((w r0 + w r ) 2 x + b r0 + b r ). Note that a r0 , w r0 + w r and b r0 + b r have no constraints, and so this network can be trained using standard gradient-based algorithms. But first we need to specify the (monotone) activation ρ. Let σ(x) = x I [x ≥ 0] denote the ReLU activation. If we choose ρ = σ, then note that in (2.3) we have log f (x) = log ∂N (x) ∂x = log m r=1 a 2 r0 (w r0 + w r ) 2 I (w r0 + w r ) 2 x + b r0 + b r ≥ 0 . This is a discontinuous function in x as well as in w r and b r . Gradient-based optimization algorithms are not applicable to problems with discontinuous objectives, and indeed this is reflected in experimental failure of such models in learning the distribution. By the same argument, any activation that has a discontinuous derivative is not admissible. Activations which have continuous derivative but are convex (e.g. ELU(x) given by e x -1 for x < 0 and x for x ≥ 0)) also cannot be used because then N (x) is also a convex function of x, which need not be the case for the optimal f . The oft-used activation tanh does not suffer from either of these defects. Pseudo network with activation tanh is given by P (x) = m r=1 a 2 r0 tanh(w 2 r0 x + b r0 ) + tanh (w 2 r0 x + b r0 ) w 2 r + 2w r0 w r x + b r . Note that P (x) is not linear in the parameters θ. Hence, it is not obvious that the loss function for the pseudo network will remain convex in parameters; indeed, non-convexity can be confirmed in experiments. A similar situation arises for exponential parameterization instead of square. To overcome the non-convexity issue, we propose another formulation for constrained normalizing flows. Here we retain the form of the neural network as in (2.4), but ensure the constraints a r0 ≥ 0 and w r0 ≥ 0 by the choice of the initialization distribution and w r0 + w r ≥ 0 by using projected gradient descent for optimization. N (x) = m r=1 a r0 tanh ((w r0 + w r ) x + (b r + b r0 )) , with constraints w r0 + w r ≥ , for all r. Here, > 0 is a small constant ensuring strict monotonicity of N (x). Note that constraints in the formulation are simple and easy to use in practice. The pseudo network in this formulation will be P (x) = m r=1 a r0 tanh(w r0 x + b r0 ) + tanh (w r0 x + b r0 ) (w r x + b r ) , with constraints w r0 + w r ≥ , for all r. P (x) is linear in θ, therefore the objective function is also convex in θ. Note that P (x) need not be forced to remain monotone using constraints: if N (x) and P (x) are sufficiently close and N (x) is strictly monotone with not too small min x ∂N (x) ∂x , then we will get monotonicity of P (x). Next, we point out that this formulation has a problem in approximation of any target function by a pseudo network. We decompose P (x) into two parts: P (x) = P c (x) + P (x), where P c (x) = m r=1 a r0 (tanh(w r0 x + b r0 )) and P (x) = m r=1 a r0 tanh (w r0 x + b r0 ) (w r x + b r ) . Note that P c (x) only depends upon initialization and does not depend on w r and b r . Hence, it can not approximate the target function after the training, therefore P (x) needs to approximate target function with P c (x) subtracted. Now, we will show that P (x) can not approximate "sufficiently non-linear" functions. The initialization distribution for w r0 is half-normal distribution with zeromean and variance= 1 m of normal distribution, i.e. w r0 = |X| where X has normal distribution with the same parameters. The bias term b r0 follows normal distribution with 0 mean and 1 m variance. Using the initialization, we can say that w r0 and |b r0 | are O √ log m √ m with high probability; therefore, |w r0 x + b r0 | is O √ log m √ m . Using the fact that tanh (y) ≈ 1 for small y, we get that tanh (w r0 x + b r0 ) ≈ 1 for sufficient large m. In such cases, P (x) becomes linear function in x and won't be able to approximate sufficiently non-linear function. Note that this issue does not arise in pseudo network with ReLU activation because the derivative of ReLU is discontinuous at 0 but as described earlier, for CNFs activations need to have continuous derivative. The same issue in approximation arises for all activations with continuous derivative. Using other variance of initializations leads to problem in other parts of the proof. This problem remains if we use normal distribution initialization of w r0 and b r0 with variance o 1 log m . For normal distribution initialization of w r0 and b r0 with variance Ω 1 log m and O(1), successfully training of CNFs to small training error can lose coupling between neural network N (x) and pseudo network P (x). Please see Appendix F for more details. A generalization argument for activations with continuous derivatives is not known even in the supervised case, therefore we do not work with constrained normalizing flow. However, we show the effect of overparameterization for constrained normalizing flow with tanh activation in experiments (Section 3).

2.2. UNCONSTRAINED NORMALIZING FLOW

Unlike the constrained case, where we modeled f (x) using a neural network N (x), here we model f (x) using a neural network. Then we have f (x) = x -1 f (u) du. While this cannot be computed exactly, good approximation can be obtained via numerical integration also known as numerical quadrature of f (x). The strict monotonicity of f is achieved by ensuring that f (x) is always positive. To this end a suitable nonlinearity is applied on top of the neural network: f (x) = φ(N (x)), where N (x) is as in (2.4) with ρ = σ = ReLU, and φ is the function ELU + 1 given by φ(x) = e x I [x < 0] + (x + 1) I [x ≥ 0]. Thus φ(x) > 0, for all x ∈ R, which means that f (x) > 0 for all x. Although this was the only property of ELU + 1 mentioned by Wehenkel & Louppe (2019) , it turns out to have several other properties which we will exploit in our proof: it is 1-Lipschitz monotone increasing; its derivative is bounded from above by 1. We denote by f (x) the estimate of f (x) = x -1 f (u) du obtained from f (x) via quadrature f (x) = Q i=1 q i f (τ i (x)). Here Q is the number of quadrature points τ 1 (x) , . . . , τ Q (x), and the q 1 , . . . , q Q ∈ R are the corresponding coefficients. Wehenkel & Louppe (2019) use Clenshaw-Curtis quadrature where the coefficients q i can be negative. We will use simple rectangle quadrature, which arises in Riemann integration, and uses only positive coefficients: f (x) = ∆ x f (-1 + ∆ x ) + f (-1 + 2∆ x ) . . . + f (x) , where ∆ x = x+1 Q . It is known (see e.g. Chapter 5 in Atkinson (1989) for related results) that f (x) -f (x) ≤ M (x + 1) 2 2Q , where M = max u∈[-1,x] |f (u)|. Compared to Clenshaw-Curtis quadrature, the rectangle quadrature requires more points for similar accuracy (in our experiments this was about double). However, we use it because all the coefficients are positive which helps make the problem of minimizing the loss a convex optimization problem. Instead of using f , to which we do not have access, we use f in the loss function, denoting it L(f , x) for the standard exponential as the base distribution to write L(f , x) = f (x) -log f (x) and L(f , S) = 1 n x∈S L(f , x). The loss LG (f , x) for the standard Gaussian as the base distribution is defined similarly. Let X be a random variable with density supported on [-1, 1] . Let the base distribution be the standard exponential, and so Z will be a random variable with the standard exponential distribution. And let F * : R → R be continuous monotone increasing such that F * -1 (Z) has the same distribution as X. Let S = {x 1 , . . . , x n } be a set of i.i.d. samples of X. Following Allen-Zhu et al. ( 2019), we initialize a r0 ∼ N (0, 2 a ), w r0 ∼ N 0, 1 m and b r0 ∼ N 0, 1 m , where a > 0 is a small constant to be set later. The SGD updates are given by θ t+1 = θ t -η ∇ θ L(f t , x t ) where f t (x) = φ(N t (x)), and x t ∈ S is chosen uniformly at random at each step. We can now state our main result. Theorem 2.1 (informal statement of Theorem E.1). (loss function is close to optimal) For any > 0 and for any target function F * with finite second order derivative, hidden layer size m ≥ C1(F * ) 2 , the number of samples n ≥ C2(F * ) 2 and the number of quadrature points Q ≥ C3(F * ) , where C 1 (•), C 2 (•), C 3 (•) are complexity measures, with probability at least 0.9, we have E sgd 1 T T -1 t=0 E x∼D L(f t , x) -E x∼D [L(F * , x)] = O( ). The complexity functions in the above statement have natural interpretations in terms of how fast the function oscillates. Now recall that KL (p F * ,Z ||p ft,Z ) = E X log p F * ,Z (X) p f t ,Z (X) , which gives E sgd 1 T T -1 t=0 KL (p F * ,Z ||p ft,Z ) = O( ). Recall that p f,Z (x) is the probability density of f -1 (Z). Using Pinsker's inequality, we can also bound the total variation distance between the learned and data distributions p ft,Z and p F * ,Z . Define pseudo network g (x), which acts as proxy for f (x), as g (x) = φ(P (x)). Note that our definition of pseudo network is not the most straightforward version: g (x) is not a linear approximation of f (x). As in Allen-Zhu et al. (2019) , we begin by showing the existence of a pseudo network close to the target function. However, for this we cannot use the approximation lemma in Allen-Zhu et al. (2019) as it seems to require dimension at least 2. We use the recent result of Ji et al. (2020) instead (Lemma B.1) . The presence of both f and f and other differences in the loss function leads to new difficulties in the analysis compared to the supervised case. We refer to the full proof due to the lack of space. 

3.1. RESULTS FOR CONSTRAINED NORMALIZING FLOW

In Sec. 2.1, we suggested that high overparameterization may adversely affect training for constrained normalizing flows. We now give experimental evidence for this. In Figs. 1, we see that as we increase the learning rate, training becomes more stable for larger m. Note that for learning rate 0.025, constrained normalizing flow with m = 1600 doesn't learn anything due to small learning rate. We observe that the L 2 -norms of W t and B t for m = 6400 are at least as large as those of m = 1600. On both datasets, as we increase the learning rate, L 2 -norm of B t increases and learning of constrained normalizing flow becomes more stable. These observations support our claim in Sec.2.1 that for learning and approximation of overparameterized constrained normalizing flow, neural networks need large L 2 -norms of W t and B t .

4. CONCLUSION

In this paper, we gave the first theoretical analysis of normalizing flows in the simple but instructive univariate case. We gave empirical and theoretical evidence that overparametrized networks are unlikely to be useful for CNFs. By contrast, for UNFs, overparametrization does not hurt and we can adapt techniques from supervised learning to analyze two-layer (or one hidden layer) networks. Our technical adaptations and NF variants may find use in future work. Our work raises a number of open problems: (1) We made two changes to the unconstrained flow architecture of Wehenkel & Louppe (2019) . An obvious open problem is an analysis of the original architecture or with at most one change. While the exponential distribution works well as the base distribution, can we also analyze the Gaussian distribution? Similarly, Clenshaw-Curtis quadrature instead of simple rectangle quadrature? These problems seem tractable but also likely to require interesting new techniques as the optimization becomes non-convex. That would get us one step closer to the architectures used in practice. (2) Analysis of constrained normalizing flows. It is likely to be difficult because, as our results suggest, one needs networks that are not highly overparametrized-this regime is not well-understood even in the supervised case. (3) Finally, analysis of normalizing flows for the multidimensional case. Our 1D result brings into focus potential difficulties: All unconstrained architectures seem to require more than one hidden layer, which poses difficult challenges even in the supervised case. For CNFs, it is possible to design an architecture with one hidden layer, but as we have seen in our analysis of CNFs, that is challenging too.

A NOTATIONS

We denote (α α α, β β β) as a concatenation of 2 vectors α α α and β β β. For any 2 vectors α α α and β β β, α α α β β β denotes element wise multiplication of α α α and β β β vector. We denote the parameters of neural network θ ∈ R 2m is concatenation of W = (w 1 , w 2 , ..., w m ) ∈ R m and B = (b 1 , b 2 , ..., b m ) ∈ R m (i.e. θ = (W, B)). Similarly, θ t = (W t , B t ) where W t = (w t 1 , w t 2 , ..., w t m ) and B t = (b t 1 , b t 2 , ..., b t m ). Similarly, A 0 = (a 10 , a 20 , . . . , a r0 , . . . , a m0 ). We denote 1 = (1, 1, . . . , 1) ∈ R m . We use Big-O notation to hide constants. We use log to denote natural logarithm. [n] denotes set {1, 2, . . . , n}

B EXISTENCE

This section contains a proof that shows existence of a pseudo network whose loss closely approximates the loss of the target function. Lemma B.1. For every positive function F * , for every x in the radius of 1 (i.e. |x| ≤ 1), there exist a function h(w r0 , b r0 ) : R 2 → [-U h , U h ] such that φ -1 (F * (x)) -E wr0,br0∼N (0,1) [h(w r0 , b r0 )I [w r0 x + b r0 ≥ 0]] ≤ ω φ -1 (F * ) (δ) where U h is given by U h = Õ φ -1 (F * ) |δ 5 L1 δ 10 (ω φ -1 (F * ) (δ)) 4 (B.1) Proof. We use a result from Ji et al. (2020) to prove the lemma.  ω ψ (δ) = sup{ψ(x) -ψ(x ) : max{|x| , |x |} ≤ 1 + δ, |x -x | ≤ δ} ψ |δ (x) :=ψ(x)I [|x| ≤ 1 + δ] ψ |δ,α :=ψ |δ * G α α := δ 1 + 2 log (2M/ω ψ (δ)) = Õ(δ) M := sup |x|≤1+δ |ψ(x)| β := 1 2πα 2 T r (w r0 , b r0 ) :=2 ψ |δ,α (0) + ψ|δ,α (v) cos 2π θ ψ |δ,α (v) -v dv + 2π 2πβ 2 ψ|δ (βw r0 ) e (b r0 ) 2 2 sin 2π θ ψ |δ,α (βw r0 ) -b r0 I [|b r0 | ≤ w r0 ≤ r] where * denotes convolution operation, G α denotes Gaussian with mean 0 and variance α 2 . Note that Õ hides logarithmic dependency of complexity measure of function ψ. ψ|δ,α denotes magnitude of fourier transform of ψ |δ,α and θ ψ |δ,α denotes phase of fourier transform. Then, sup |x|≤1 ψ(x) -E wr0,br0∼N (0,1) [T r (w r0 , b r0 )I [w r0 x + b r0 ≥ 0]] ≤ ω ψ (δ) (B.2) The upper bound of T r (w r0 , b r0 ) is given by sup wr0,br0 T r (w r0 , b r0 ) = Õ ψ |δ 5 L1 δ 10 (ω ψ (δ)) 4 = U T (B.3) Using Result B.1 for φ -1 (F * (x)) function, denoting T r (w r0 , b r0 ) for φ -1 (F * (x)) function as h(w r0 , b r0 ), we get φ -1 (F * (x)) -E wr0,br0∼N (0,1) [h(w r0 , b r0 )I [w r0 x + b r0 ≥ 0]] ≤ ω φ -1 (F * ) (δ) with following upper bound on h(w r0 , b r0 ). sup wr0,br0 h(w r0 , b r0 ) ≤ Õ φ -1 (F * ) |δ 5 L1 δ 10 (ω φ -1 (F * ) (δ)) 4 = U h Divide pseudo network P (x) into 2 parts: P c (x), first part of pseudo network is constant and time-independent and P (x), second part of pseudo network is linear in w r and b r P (x) = P c (x) + P (x) where P c (x) = m r=1 a r0 (w r0 x + b r0 ) I [w r0 x + b r0 ≥ 0] P (x) = m r=1 a r0 (w r x + b r ) I [w r0 x + b r0 ≥ 0] Lemma B.2. (Approximating target function using P (x)) For every positive function F * and for every ∈ (0, 1), with at least 1 -1 c1 -exp - 2 m 128c 2 1 U 2 h log m probability over random initialization, there exist θ * such that we get following inequality for all x ∈ [-1, 1] and some fixed positive constant c 1 > 1. |φ(P * (x)) -F * (x)| ≤ ω φ -1 (F * ) (δ) + and upper bound L ∞ norm of parameters is given by θ * ∞ ≤ U h √ π √ 2m a Proof. Define w * r and b * r as w * r = 0 b * r = sign (a r0 ) √ π m a √ 2 h( √ mw r0 , √ mb r0 ) (B.4) Using w * r and b * r , E ar0∼N (0, 2 a ),wr0∼N (0, 1 m ),br0∼N (0, 1 m ) [P * (x)] = E ar0∼N (0, 2 a ),wr0∼N (0, 1 m ),br0∼N (0, 1 m ) m r=1 a r0 (w * r x + b * r )I [w r0 x + b r0 ≥ 0] = E ar0∼N (0, 2 a ),wr0∼N (0, 1 m ),br0∼N (0, 1 m ) a r0 sign (a r0 ) √ π a √ 2 h( √ mw r0 , √ mb r0 )I [w r0 x + b r0 ≥ 0] (i) = E wr0∼N (0, 1 m ),br0∼N (0, 1 m ) h( √ mw r0 , √ mb r0 )I √ m (w r0 x + b r0 ) ≥ 0 where equality (i) follows from Fact H.2 and homogeneity of indicator function. Using Lemma B.1,  E ar0∼N (0, 2 a ),wr0∼N (0, 1 m ),br0∼N (0, 1 m ) [P * (x)] -φ -1 (F * (x)) = E wr0∼N (0, 1 m ),br0∼N (0, 1 m ) h( √ mw r0 , √ mb r0 )I √ m (w r0 x + b r0 ) ≥ 0 -φ -1 (F * (x)) ≤ ω φ -1 (F * ) (δ) (B. E [h] = 2 m E ar0,wr0,br0,ξr sup x m m r=1 ξ i (w * r x + b * r ) I [w r0 x + b r0 ≥ 0] where ξ 1 , ξ 2 , . . . , ξ m are independent Rademacher random variables. E ar0,wr0,br0 [h] ≤ 2 m E ar0,wr0,br0,ξr sup x m m r=1 ξ i a r0 (w * r x + b * r ) I [w r0 x + b r0 ≥ 0] ≤ 2 m E ar0,wr0,br0,ξr sup x m m r=1 ξ i a r0 (w * r x + b * r ) I [w r0 x + b r0 ≥ 0] ≤ 8c 1 √ log mU h m E ar0,wr0,br0,ξr sup x m r=1 ξ i I [w r0 x + b r0 ≥ 0] One can show that 1 m E ar0,wr0,br0,ξr sup x m r=1 ξ i I [w r0 x + b r0 ≥ 0] ≤ 2 log m m Using this relation, we get E ar0,wr0,br0 [h] ≤ 16c 1 U h log m √ m Using Mcdiarmid's inequality, with at least 1 -1 c1 -exp - 2 m 128c 2 1 U 2 h log m , we have |P * (x) -E ar0,wr0,br0 [P * (x)]| = h =≤ 2 + 16c 1 U h log m √ m (i) ≤ (B.6) where inequality (i) follows from our choice of m in lemma D.2. Using eq.(B.5), we get P * (x) -φ -1 (F * (x)) ≤ ω φ -1 (F * ) (δ) + (B.7) Using 1-Lipschitzness of φ, we get |φ(P * (x)) -F * (x)| = φ(P * (x)) -φ φ -1 (F * (x)) ≤ P * (x) -φ -1 (F * (x)) ≤ ω φ -1 (F * ) (δ) + The upper bound on norm of θ * ∞ is given by the following equation. θ * ∞ ≤ U h √ π √ 2m a Corollary B.1. (Approximating target network using P (x)) For every positive function F * and for every ∈ (0, 1), with at least 0.99 -1 c1 -1 c6 -1 c7 -exp - 2 m 128c 2 1 U 2 h log m probability over random initialization, there exists θ * such that we have following inequality for all x ∈ [-1, 1] and some fixed positive constants c 1 > 1, c 6 > 1 and c 7 > 1. |φ (P * (x)) -F * (x)| ≤ 16c 1 (c 6 + c 7 ) a log m + ω φ -1 (F * ) (δ) + and upper bound on L ∞ norm of parameters θ * is given by  θ * ∞ ≤ U h √ π √ 2m a Proof. a r0 w r0 ≥ t   ≤ exp - 2t 2 m m 2c 1 a √ 2 log m 2 2c 6 √ 2 log m 2 = exp - t 2 32c 2 1 c 2 6 2 a (log m) 2 Taking t = 16c 1 c 6 a (log m), with at least probability 0.999 -1 c1 -1 c6 , we have m r=1 a r0 w r0 ≤ 16c 1 c 6 a (log m) and similarly, we will get that with at least 0.999 -1 c1 -1 c7 probability, m r=1 a r0 w r0 ≤ 16c 1 c 7 a (log m) we will get that at least 0.999 -1 c1 -1 c6 -1 c7 probability, we have m r=1 a r0 w r0 I [w r0 x + b r0 ≥ 0] ≤ 16c 1 c 6 a (log m) (B.8) m r=1 a r0 b r0 I [w r0 x + b r0 ≥ 0] ≤ 16c 1 c 7 a (log m) Using these relations, we get that with at least 0.99 -1 c1 -1 c6 -1 c7 probability, m r=1 a r0 (w r0 x + b r0 ) I [w r0 x + b r0 ≥ 0] ≤ 16c 1 (c 6 + c 7 ) a log m (B.9) Using above inequality, we get |φ (P * (x)) -φ(P * (x))| ≤ |P * (x) -P * (x)| ≤ 16c 1 (c 6 + c 7 ) a log m Using lemma B.2, with at least 0.99 -1 c1 -1 c6 -1 c7 -exp - 2 m 128c 2 1 U 2 h log m probability, |φ (P * (x)) -F * (x)| ≤ |φ (P * (x)) -φ(P * (x))| + |φ(P * (x)) -F * (x)| ≤ 16c 1 (c 6 + c 7 ) a log m + ω φ -1 (F * ) (δ) + Lemma B.3. ( loss) For every positive function F * and for every ∈ (0, 1), with at least 0.99 -1 c1 -1 c6 -1 c7 -exp - 2 m 128c 2 1 U 2 h log m probability over random initialization, there exist θ * such that loss of pseudo network with θ * parameters is close to that of the target function for all x ∈ [-1, 1] and for some fixed positive constants c 1 > 1, c 6 > 1 and c 7 > 1. L (φ (P * ) , x) -L (F * , x) ≤ 3 16c 1 (c 6 + c 7 ) a log m + ω φ -1 (F * ) (δ) + Proof. L (φ (P * ) , x) -L (F * , x) ≤ Q i=1 ∆ x φ (P * (τ i (x))) - Q i=1 ∆ x F * (τ i (x)) + |log (φ (P * (x))) -log (F * (x))| (i) ≤ 2 16c 1 (c 6 + c 7 ) a log m + ω φ -1 (F * ) (δ) + + P * (x) -φ -1 (F * (x)) ≤ 2 16c 1 (c 6 + c 7 ) a log m + ω φ -1 (F * ) (δ) + + |P * c (x)| + P * (x) -φ -1 (F * (x)) (ii) ≤ 3 16c 1 (c 6 + c 7 ) a log m + ω φ -1 (F * ) (δ) + where inequality (i) follows from Corollary B.1 with at least 0.99 -1 c1 -1 c6 -1 c7 - exp - 2 m 128c 2 1 U 2 h log m probability. Inequality (ii) uses Eq.(B.7) and Eq.(B.9).

C COUPLING

In this section, we prove that, for random initialization, the gradients of the loss of pseudo network closely approximate the gradients of the loss of the target function. In other words, we show coupling of their gradient-based optimizations. Define λ 1 as λ 1 = sup t∈[T ],r∈[m],w t r ,b t r ,|x|≤1 φ (N t (x)) φ(N t (x)) (C.1) We get following find upper bound on λ 1 . λ 1 = sup t∈[T ],r∈[m],w t r ,b t r ,|x|≤1 φ (N t (x)) φ(N t (x)) = sup t∈[T ],r∈[m],w t r ,b t r ,|x|≤1 exp (N t (x)) I [N t (x) < 0] + I [N t (x) ≥ 0] exp (N t (x)) I [N t (x) < 0] + (N t (x) + 1) I [N t (x) ≥ 0] = sup t∈[T ],r∈[m],w t r ,b t r ,|x|≤1 I [N t (x) < 0] + I [N t (x) ≥ 0] N t (x) + 1 = 1 (C.2) Define ∆ as ∆ = 6c 1 a 2 log m (C.3) for some positive constant c 1 > 1. Lemma C.1. (Bound in change in patterns) For every x in 1 radius (|x| ≤ 1) and for every time step t ≥ 1, with probability at least 1 -1 c1 -exp -64(c2-1) 2 η 2 m 2 ∆2 t 2 π over random initialization, for at most c 2 4 √ 2η √ m ∆t √ π fraction of r ∈ [m] I (w r0 + w t r )x + b r0 + b t r ≥ 0 = I [w r0 x + b r0 ≥ 0] for some positive constant c 1 > 1 and c 2 ≥ 1. Proof. Taking derivative of L(f , x) wrt w r , ∂ L(f t , x) ∂w r = Q i=1 ∆ x φ N t (τ i (x)) a r0 σ (w r0 + w t r )τ i (x) + b r0 + b t r τ i (x) + 1 φ(N t (x)) φ (N t (x))a r0 σ (w r0 + w t r )x + b r0 + b t r x ≤ Q i=1 ∆ x φ N t (τ i (x)) a r0 σ (w r0 + w t r )τ i (x) + b r0 + b t r τ i (x) + φ (N t (x)) φ(N t (x)) a r0 σ (w r0 + w t r )x + b r0 + b t r x Using Eq.(C.2), ∆ x ≤ 2 Q , |x| ≤ 1 and |φ (N (x))| ≤ 1 for all x ∈ [-1, 1], we get ∂ L(f t , x) ∂w r ≤ 3 |a r0 | Using Lemma H.2, with at least 1 -1 c1 probability, we get ∂ L(f t , x) ∂w r ≤ ∆ (C.4) where ∆ is defined in Eq.(C.3). Using same procedure for b r , we get ∂ L(f t , x) ∂b r = Q i=1 ∆ x φ (N t (τ i (x))) a r0 σ (w r0 + w t r )τ i (x) + b r0 + b t r + 1 φ(N t (x)) φ (N t (x))a r0 σ (w r0 + w t r )x + b r0 + b t r ≤ 3 |a r0 | = ∆ (C.5) Using Eq.(C.4) and Eq.(C.5), we get w t r ≤ η ∆t b t r ≤ η ∆t (C.6) Define H t = {r ∈ [m]| |w r0 x + b r0 | ≥ 4η ∆t} (C.7) For every x with |x| ≤ 1 and for all r ∈ [m], |w t r x + b t r | ≤ 2η ∆t. For all r ∈ H t , we get I [(w r0 + w t r )x + b r0 + b t r ≥ 0] = I [w r0 x + b r0 ≥ 0] . Now, we need to bound the size of H t . We know that for all x ∈ [-1, 1], w r0 x + b r0 is Gaussian with E [w r0 x + b r0 ] = 0 and Var [w r0 x + b r0 ] ≥ 1 m . Using Lemma H.3, we get Pr |w r0 x + b r0 | ≤ 4η ∆t ≤ 4 √ 2η √ m ∆t √ π Under review as a conference paper at ICLR 2021 Using Fact H.1 for H c t (where  H c t = [m]/H t ) for some positive constant c 2 ≥ 1, we get Pr |H c t | ≥ c 2 m 4 √ 2η √ m ∆t √ π ≤ exp   -2m (c 2 -1) 4 √ 2η √ m ∆t √ π 2   ≤ exp - 64(c 2 -1) 2 η 2 m 2 ∆2 t 2 π Pr |H c t | ≤ c 2 m 4 √ 2η √ m ∆t √ π ≥ 1 -exp - 64(1 -c 2 ) 2 η 2 m 2 ∆2 t 2 π Pr |H t | ≥ m 1 -c 2 4 √ 2η √ m ∆t √ π ≥ 1 -exp - 64(1 -c 2 ) 2 η 2 m 2 ∆2 |φ(N t (x)) -φ(P t (x))| ≤ 24c 1 a η ∆t H t c 2 log m Proof. We know that φ is 1-Lipschitz continuous. Using Lipschitz continuity of φ, we get |φ(N t (x)) -φ(P t (x))| ≤ |N t (x) -P t (x)| We bound |N t (x) -P t (x)| as following. |N t (x) -P t (x)| ≤ r∈[m] a r0 (w r0 + w t r )x + b r0 + b t r I (w r0 + w t r )x + b r0 + b t r ≥ 0 - r∈[m] a r0 (w r0 + w t r )x + b r0 + b t r I [w r0 x + b r0 ≥ 0] ≤ r / ∈Ht a r0 (w r0 + w t r )x + b r0 + b t r I (w r0 + w t r )x + b r0 + b t r ≥ 0 -I [w r0 x + b r0 ≥ 0] (i) ≤ H t c 2c 1 a 2 log m 4η ∆t + 2η ∆t (2) ≤24c 1 a η ∆t H t c 2 log m (C.8) where inequality (i) uses Lemma H.2 with at least 1 -1 c1 probability. Corollary C.1. (Final bound on difference of f and g ) For every x in 1 radius (|x| ≤ 1) and for every time step t ≥ 1, with at least 1 -1 c1 -exp -64(c2-1) 2 η 2 m 2 ∆2 t 2 π probability over random initialization, function with neural network and function with pseudo network are close for some positive constants c 1 > 1 and c 2 ≥ 1. |φ(N t (x)) -φ(P t (x))| ≤ 192η 2 m 1.5 ∆2 c 1 c 2 a t 2 √ log m √ π (C.9) Proof. Using Lemma C.1 and Lemma C.2, we get |φ(N t (x)) -φ(P t (x))| ≤24c 1 a η ∆t H t c 2 log m (i) ≤24c 1 a η ∆t c 2 m 4 √ 2η √ m ∆t √ π 2 log m ≤ 192ηm 1.5 ∆c 1 c 2 a t √ log m √ π η ∆t = 192η 2 m 1.5 ∆2 c 1 c 2 a t 2 √ log m √ π (C.10) ≤ O(η 2 m 1.5 ∆2 a t 2 log m) where inequality (i) uses Lemma C.1 and the inequality follows with at least 1 -1 c1 - exp -64(c2-1) 2 η 2 m 2 ∆2 t 2 π probability. Define ∆ t np as ∆ t np = 192η 2 m 1.5 ∆2 c 1 c 2 a t 2 √ log m √ π (C.11) Lemma C.3. (Coupling of loss functions) For all x in 1 radius (|x| ≤ 1) and for every time step t ≥ 1, with probability at least 1 -1 c1 -exp -64(c2-1) 2 η 2 m 2 ∆2 t 2 π over random initialization, loss function of neural network and pseudo network are close for some positive constant c 1 > 1 and c 2 ≥ 1. L (f t , x) -L (g t , x) ≤ 3∆ t np Proof. L (f t , x) -L (g t , x) ≤ Q i=1 ∆ x f t (τ i (x)) - Q i=1 ∆ x g t (τ i (x)) + |log (f t (x)) -log (g t (x))| (i) ≤2 sup i∈[Q] |f t (τ i (x)) -g t (τ i (x))| + |N t (x) -P t (x)| (ii) ≤ 3∆ t np where inequality (i) follows from 1-Lipschitz continuity of log (φ(N (x))) with respect to N (x). Inequality (ii) uses Eq.(C.8) and Lemma C.2. Lemma C.4. (Coupling of gradient of functions) For all x in 1 radius (|x| ≤ 1) and for every time step t ≥ 1, with at least 1 -1 c1 probability over random initialization, gradient of derivative of neural network function and derivative of pseudo network function with respect to parameters are close for some positive constant c 1 > 1. ∇ θ f t (x) -∇ θ g t (x) 1 ≤ 4c 1 a m∆ t np + 2 |H c t | 2 log m Proof. ∇ θ f t (x) -∇ θ g t (x) 1 ≤ φ (N t (x))∇ θ N t (x) -φ (P t (x))∇ θ P t (x) 1 ≤ φ (N t (x))∇ θ N t (x) -φ (P t (x))∇ θ N t (x) 1 + φ (P t (x))∇ θ N t (x) -φ (P t (x))∇ θ P t (x) 1 ≤ |φ (N t (x)) -φ (P t (x))| ∇ θ N t (x) 1 + |φ (P t (x))| ∇ θ N t (x) -∇ θ P t (x) 1 ≤ |N t (x) -P t (x)| ∇ θ N t (x) 1 + ∇ θ N t (x) -∇ θ P t (x) 1 where last inequality follows from 1-Lipschitzness of φ function and φ (x) ≤ 1 for all x such that |x| ≤ 1, t ∈ [T ]. To upper bound ∇ θ N t (x) -∇ θ P t (x) 1 , ∇ θ N t (x) -∇ θ P t (x) 1 ≤ (A 0 , A 0 ) (1x, 1) (I (W 0 + W t )x + B 0 + B t ≥ 0 -I [W 0 x + B 0 ≥ 0] , I (W 0 + W t )x + B 0 + B t ≥ 0 -I [W 0 x + B 0 ≥ 0]) 1 (i) ≤ 8c 1 a 2 log m |H c t | ≤ 8c 1 a |H c t | 2 log m (C.12) The inequality (i) uses property of H t that for all r ∈ H t , I [(w r0 + w t r )x + b r0 + b t r ≥ 0] = I [w r0 x + b r0 ≥ 0]. Using Eq.(C.11) and Eq.(C.12), we get ∇ θ f t (x) -∇ θ g t (x) 1 ≤ |N t (x) -P t (x)| (A 0 , A 0 ) (1x, 1) (I (W 0 + W t )x + B 0 + B t ≥ 0 , I (W 0 + W t )x + B 0 + B t ≥ 0 ) 1 + ∇ θ N t (x) -∇ θ P t (x) 1 ≤ 4c 1 a m∆ t np 2 log m + 8c 1 a |H c t | 2 log m = 4c 1 a m∆ t np + 2 |H c t | 2 log m Lemma C.5. (Coupling of gradient of loss) For all x in 1 radius (|x| ≤ 1) and for every time step t ≥ 1, with probability at least 1 -1 c1 -exp -64(c2-1) 2 η 2 m 2 ∆2 t 2 π over random initialization, gradient of loss function with neural network and loss function with pseudo network are close for some positive constant c 1 > 1 and c 2 ≥ 1. ∇ θ L(f t , x) -∇ θ L(g t , x) 1 ≤ 192ηm 1.5 ∆c 1 c 2 a t √ log m √ π + 16c 1 a m∆ t np 2 log m Proof. ∇ θ L(f t , x) -∇ θ L(g t , x) 1 ≤ Q i=1 ∆ x ∇ θ f t (τ i (x)) - ∇ θ f t (x) f t (x) - Q i=1 ∆ x ∇ θ g t (τ i (x)) + ∇ θ g t (x) g t (x) 1 ≤ Q i=1 ∆ x ∇ θ f t (τ i (x)) - Q i=1 ∆ x ∇ θ g t (τ i (x)) 1 I + ∇ θ g t (x) g t (x) - ∇ θ f t (x) f t (x) 1 II Proving bound on I, I = Q i=1 ∆ x ∇ θ f t (τ i (x)) - Q i=1 ∆ x ∇ θ g t (τ i (x)) 1 ≤ Q i=1 ∆ x ∇ θ f t (τ i (x)) -∇ θ g t (τ i (x)) 1 (i) ≤ 8c 1 a m∆ t np + 2 |H c t | 2 log m where inequality (i) follows from Lemma C.4. Now, we will bound II, II = ∇ θ g t (x) g t (x) - ∇ θ f t (x) f t (x) 1 = exp (P t (x)) I [P t (x) < 0] + I [P t (x) ≥ 0] exp (P t (x)) I [P t (x) < 0] + (P t (x) + 1) I [P t (x) ≥ 0] ∇ θ P t (x) - exp (N t (x)) I [N t (x) < 0] + I [N t (x) ≥ 0] exp (N t (x)) I [N t (x) < 0] + (N t (x) + 1) I [N t (x) ≥ 0] ∇ θ N t (x) 1 = I [P t (x) < 0] + I [P t (x) ≥ 0] (P t (x) + 1) ∇ θ P t (x) -I [N t (x) < 0] + I [N t (x) ≥ 0] (N t (x) + 1) ∇ θ N t (x) 1 = ∇ θ P t (x) -∇ θ N t (x) 1 I [P t (x) < 0, N t (x) < 0] II1 + ∇ θ P t (x) - ∇ θ N t (x) N t (x) + 1 1 I [P t (x) < 0, N t (x) ≥ 0] II2 + ∇ θ P t (x) P t (x) + 1 -∇ θ N t (x) 1 I [P t (x) ≥ 0, N t (x) < 0] II3 + ∇ θ P t (x) P t (x) + 1 - ∇ θ N t (x) N t (x) + 1 1 I [P t (x) ≥ 0, N t (x) ≥ 0] II4 On simplifying II 2 , we get II 2 ≤ 1 N t (x) + 1 ∇ θ P t (x) -∇ θ N t (x) 1 + N t (x) 1 + N t (x) ∇ θ P t (x) 1 I [P t (x) < 0, N t (x) ≥ 0] ≤ ∇ θ P t (x) -∇ θ N t (x) 1 + ∆ t np ∇ θ P t (x) 1 I [P t (x) < 0, N t (x) ≥ 0] (C.13) Similarly, on simplifying II 3 , we get II 3 ≤ 1 P t (x) + 1 ∇ θ P t (x) -∇ θ N t (x) 1 + P t (x) 1 + P t (x) ∇ θ N t (x) 1 I [P t (x) ≥ 0, N t (x) < 0] ≤ ∇ θ P t (x) -∇ θ N t (x) 1 + ∆ t np ∇ θ N t (x) 1 I [P t (x) ≥ 0, N t (x) < 0] (C.14) On simplifying II 4 , we get II 4 ≤ ∇ θ P t (x) P t (x) + 1 - ∇ θ N t (x) P t (x) + 1 1 + ∇ θ N t (x) P t (x) + 1 - ∇ θ N t (x) N t (x) + 1 1 I [P t (x) ≥ 0, N t (x) ≥ 0] ≤ 1 P t (x) + 1 ∇ θ P t (x) -∇ θ N t (x) 1 + ∇ θ N t (x) 1 ∆ t np (P t (x) + 1) (N t (x) + 1) I [P t (x) ≥ 0, N t (x) ≥ 0] ≤ ∇ θ P t (x) -∇ θ N t (x) 1 + ∇ θ N t (x) 1 ∆ t np I [P t (x) ≥ 0, N t (x) ≥ 0] (C.15) Using Eq.(C.13), Eq.(C.14) and Eq.(C.15), we get II = ∇ θ g t (x) g t (x) - ∇ θ f t (x) f t (x) 1 ≤ ∇ θ P t (x) -∇ θ N t (x) 1 + ∇ θ N t (x) 1 ∆ t np I [P t (x) ≥ 0] + ∆ t np ∇ θ P t (x) 1 I [P t (x) < 0, N t (x) ≥ 0] Using Eq.(C.12), we get II ≤ 8c 1 a |H c t | 2 log m + ∆ t np ∇ θ N t (x) 1 + ∇ θ P t (x) 1 ≤ 8c 1 a |H c t | 2 log m + ∆ t np (A 0 , A 0 ) (1x, 1) (I [W 0 x + B 0 ≥ 0] , I [W 0 x + B 0 ≥ 0]) 1 + (A 0 , A 0 ) (1x, 1) I (W 0 + W t )x + B 0 + B t ≥ 0 , I (W 0 + W t )x + B 0 + B t ≥ 0 1 ≤ 8c 1 a |H c t | 2 log m + ∆ t np 8c 1 a m 2 log m = 8c 1 a |H c t | + m∆ t np 2 log m (C.16) Combining bounds on I and II, we get ∇ θ L(f t , x) -∇ θ L(g t , x) 1 ≤ 8c 1 a m∆ t np + 2 |H c t | 2 log m + 8c 1 a |H c t | + m∆ t np 2 log m ≤ 8c 1 a 2m∆ t np + 3 |H c t | 2 log m Using Lemma C.1, with at least 1 -1 c1 -exp -64(c2-1) 2 η 2 m 2 ∆2 t 2 π probability, we get Proof. The loss function for pseudo network is ∇ θ L(f t , x) -∇ θ L(g t , x) 1 ≤ 192ηm 1.5 ∆c 1 c 2 a t √ log m √ π + 16c 1 a m∆ t np 2 log m Define Γ as the upper bound on ∇ θ L(f t , x) -∇ θ L(g t , x) 1 . Γ = 192ηm 1.5 ∆c 1 c 2 a t √ log m √ π + 16c 1 a m∆ t L(g t , x) = Q i=1 ∆ x g t (τ i (x)) -log (g t (x)) Dividing the loss function in 2 parts, L(g t , x) = L1 (g t , x) + L2 (g t , x) where L1 (g t , x) = Q i=1 ∆ x g t (τ i (x)) L2 (g t , x) = -log (g t (x)) We will prove convexity of both L1 (g t , x) and L2 (g t , x). To prove convexity of L1 (g t , x) as a function of parameters θ, we will prove that Hessian of L1 (g t , x) is positive semidefinite. ∇ θ L1 (g t , x) = Q i=1 ∆ x ∇ θ g t (τ i (x)) = Q i=1 ∆ x φ (P t (τ i (x))) ∇ θ P t (τ i (x)) ∇ 2 θ L1 (g t , x) = Q i=1 ∆ x ∇ 2 θ g t (τ i (x)) = Q i=1 ∆ x φ (P t (τ i (x)))∇ θ P t (τ i (x))∇ θ P t (τ i (x)) T + Q i=1 ∆ x φ (P t (τ i (x)))∇ 2 θ P t (τ i (x)) = Q i=1 ∆ x φ (P t (τ i (x)))∇ θ P t (τ i (x))∇ θ P t (τ i (x)) T The first term of the Hessian matrix is sum of Gram matrix and the second term of the Hessian matrix is Gram matrix. Hence, the Hessian of L1 (g t , x) is positive semidefinite. For second term, L2 (g t , x) = -log (exp (P t (x)) I [P t (x) ≤ 0] + (P t (x) + 1) I [P t (x) > 0]) = -P t (x)I [P t (x) ≤ 0] -log (P t (x) + 1) I [P t (x) > 0] Note that L2 (g t , x) is convex in P t (x) and P t (x) is linear in θ. Composition of convex and linear function is convex therefore, L2 (g t , x) is convex in θ. As sum of 2 convex functions is convex, L(g t , x) is convex. Remark D.1. If we use base distribution as standard Gaussian distribution, then loss function will have following term. L1 (g t , x) = Q i=1 ∆ x g t (τ i (x)) 2 If we find Hessian of L1 , then we get ∇ θ L1 (g t , x) = Q i=1 ∆ x g t (τ i (x)) Q i=1 ∆ x φ (P t (τ i (x))) ∇ θ P t (τ i (x)) ∇ 2 θ L1 (g t , x) = Q i=1 ∆ x g t (τ i (x)) Q i=1 ∆ x φ (P t (τ i (x)))∇ θ P t (τ i (x))∇ θ P t (τ i (x)) T + Q i=1 ∆ x ∇ θ g t (τ i (x)) Q i=1 ∆ x ∇ θ g t (τ i (x)) T If base distribution is standard Gaussian distribution, then t has to be negative for some points therefore the first term in the Hessian won't remain positive semi-definite therefore, the loss function won't remain convex in parameters of neural network θ if we use standard Gaussian distribution as base distribution. At points with negative values of g, Hessian of L1 (g t , x) with respect to θ can be negative semidefinie and L1 (g t , x) can be non-convex in θ. Lemma D.2. (Approximated loss is close to optimal loss) For every ∈ (0, 1), there exist m > poly U h , 1 , η = Õ 1 m and T = O U 2 h log m 2 such that, with at least 0.95 - exp - 2 m 128c 2 1 U 2 h log m probability, we get 1 T T -1 t=0 E sgd [ L(f t , X )] -L(F * , X ) ≤ O( ) Proof. For set of examples X , define L(f t , X ) = 1 |X | x∈X L(f t , x) From Lemma D.1, we know that L(g t , X ) is convex in parameters θ. Using convexity of L(g t , X ) wrt θ, L(g t , X ) -L(g * , X ) ≤ ∇ θ L(g t , X ), θ t -θ * ≤ ∇ θ L(g t , X ) -∇ θ L(f t , X ) 1 θ t -θ * ∞ + ∇ θ L(f t , X ), θ t -θ * (D.1) where . 1 and . ∞ denotes l 1 and l ∞ norm respectively. The stochastic gradient descent updates the parameters using x t at time t. g * is defined as following. g * (x) = m r=1 a r0 σ (w r0 x + b r0 ) + m r=1 a r0 σ (w r0 x + b r0 ) (w * r x + b * r ) For stochastic gradient descent, we get θ t+1 -θ * 2 2 = θ t -η∇ θ L(f t , x t ) -θ * 2 2 = θ t -θ * 2 2 + η 2 ∇ θ L(f t , x t ) 2 2 -2η θ t -θ * , ∇ θ L(f t , x t ) Taking expectation wrt x t , E x t θ t+1 -θ * 2 2 = θ t -θ * 2 2 + η 2 E x t ∇ θ L(f t , x t ) 2 2 -2η ∇ θ L(f t , X ), θ t -θ * (D.2) Using Eq.(D.2) and Eq.(D.1), L(g t , X ) -L(g * , X ) ≤ ∇ θ L(g t , X ) -∇ θ L(f t , x) 1 θ t -θ * ∞ + θ t -θ * 2 2 -E x t θ t+1 -θ * 2 2 2η + η 2 E x t ∇ θ L(f t , x t ) 2 2 Using Eq.(C.4) and Eq.(C.5), we get ∇ θ L(f t , x t ) 2 2 ≤ 2m ∆2 Averaging from t = 0 to T -1, we get 1 T T -1 t=0 E sgd [ L(g t , X )] -L(g * , X ) ≤ Γ sup t∈[T ] θ t ∞ + θ * ∞ + θ 0 -θ * 2 2 2ηT + ηm ∆2 (i) = Γ sup t∈[T ] θ t ∞ + θ * ∞ + θ * 2 2 2ηT + ηm ∆2 (D.3) Note that ∆ and Γ are defined in Eq.(C.3) and Eq.(C.17). Inequality (i) follows from the fact that θ 0 = (0, 0, . . . , 0) ∈ R 2m . Using Lemma B.3 and Lemma C.3, with at least 0.99 -1 c1 - Under review as a conference paper at 2021 T t=1 exp -64(c2-1) 2 η 2 m 2 ∆2 t 2 π -1 c6 -1 c7 -exp - 2 m 128c 2 1 U 2 h log m probability, we get 1 T T -1 t=0 E sgd [ L(g t , X )] -L(g * , X ) ≤ Γ sup t∈[T ] θ t ∞ + θ * ∞ + θ * 2 2 2ηT + ηm ∆2 1 T T -1 t=0 E sgd [ L(f t , X )] -L(g * , X ) (i) ≤ Γ sup t∈[T ] θ t ∞ + θ * ∞ + θ * 2 2 2ηT + ηm ∆2 + 3∆ t np 1 T T -1 t=0 E sgd [ L(f t , X )] -L(F * , X ) (ii) ≤ Γ sup t∈[T ] θ t ∞ + θ * ∞ + θ * 2 2 2ηT + ηm ∆2 + 3∆ t np + 3 16c 1 (c 6 + c 7 ) a log m + ω φ -1 (F * ) (δ) + Note that inequality (i) and inequality (ii) uses Lemma B.3 and Lemma C.3 respectively. We choose following values/relations of η, T . η = m ∆2 = m 6c 1 a √ 2 log m 2 = 72c 2 1 m 2 a log m T = θ * 2 2 2η = U 2 h π 2m 2 a 72c 2 1 m 2 a log m 2 2 = 18πc 2 1 log m U 2 h 2 (D.4) We can choose δ such that ω φ -1 (F * ) (δ) = . Using above inequalities, we get following equalities. θ * 2 2 2ηT = θ * 2 2 2η 2η θ * 2 2 = ηm ∆2 = m ∆2 m ∆2 = 3 16c 1 (c 6 + c 7 ) a log m + ω φ -1 (F * ) (δ) + ≤ 3 (16c 1 (c 6 + c 7 ) a log m + 2 ) Using Corollary B.1, we get θ * ∞ ≤ U h √ π √ 2m a θ * 2 ≤ √ m θ * ∞ ≤ U h √ π √ 2 √ m a To get value of sup t∈[T ] θ t ∞ = sup t∈[T ] η ∆t = η ∆T = θ * 2 2 ∆ 2 ≤ U 2 h π 2m 2 a 6c 1 a √ 2 log m 2 = 3πU 2 h c 1 √ log m √ 2m a sup t∈[T ] θ t ∞ + θ * ∞ ≤ πU 2 h (1 + 3c 1 ) √ log m √ 2m a Γ = 192ηm 1.5 ∆c 1 c 2 a t √ log m √ π + 16c 1 a m∆ t np 2 log m ≤ 192ηm 1.5 ∆c 1 c 2 a t √ log m √ π + 16c 1 a m 2 log m 192η 2 m 1.5 ∆2 c 1 c 2 a t 2 √ log m √ π ≤ 192ηm 1.5 ∆c 1 c 2 a t √ log m √ π + 3072 √ 2c 2 1 c 2 2 a η 2 t 2 m 2.5 log m ∆2 √ π ≤ 192m 1.5 c 1 c 2 a √ log m √ π U 2 h π 4m 2 a 6c 1 a 2 log m + 3072 √ 2c 2 1 c 2 2 a m 2.5 log m √ π U 2 h π 4m 2 a 2 6c 1 a 2 log m 2 ≤ 288 √ 2π √ m log mc 2 1 c 2 U 2 h + 13824 √ 2π 3 c 4 1 c 2 √ m (log m) 2 U 4 h 2 ≤ 14112 √ 2π 3 c 4 1 c 2 √ m (log m) 2 U 4 h 2 Multiplication of Γ and sup t∈[T ] θ t ∞ + θ * ∞ will be Γ sup t∈[T ] θ t ∞ + θ * ∞ ≤ 14112 √ 2π 3 c 4 1 c 2 √ m (log m) 2 U 4 h 2 πU 2 h (1 + 3c 1 ) √ log m √ 2m a = 14112π 2.5 c 4 1 c 2 (1 + 3c 1 ) (log m) 2.5 U 6 h √ m a 3 Taking m as m ≥ Ω c 8 1 c 2 2 (1 + 3c 1 ) 2 U 12 h 2 a 8 (D.5) Choosing m which satisfies above inequality will give us the following inequality. Γ sup t∈[T ] θ t ∞ + θ * ∞ ≤ Using Eq.(C.11), we get ∆ t np = 192η 2 m 1.5 ∆2 c 1 c 2 a t 2 √ log m √ π ≤ 192m 1.5 c 1 c 2 a √ log m √ π U 2 h π 4m 2 a 2 6c 1 a 2 log m 2 = 864π 1.5 c 3 1 c 2 (log m) 1.5 U 4 h √ m 2 a (D.6) Using sufficiently high m, we get ∆ t np ≤ 864π 1.5 c 3 1 c 2 (log m) 1.5 U 4 h √ m 2 a ≤ O c 3 1 c 2 (log m) 1.5 U 4 h a 4 2 a c 4 1 c 2 (1 + 3c 1 ) U 6 h = O 2 (log m) 1.5 c 1 (1 + 3c 1 ) U 2 h ≤ O ( ) Using Eq.(D.4) and Eq.(D.5), with at least 0.99- 1 c1 -1 c6 -1 c7 - T t=1 exp -64(c2-1) 2 η 2 m 2 ∆2 t 2 π - exp - 2 m 128c 2 1 U 2 h log m probability, we get 1 T T -1 t=0 E sgd [ L(f t , X )] -L(F * , X ) ≤ Γ sup t∈[T ] θ t ∞ + θ * ∞ + θ * 2 2 2ηT + ηm ∆2 + 2592π 1.5 c 3 1 c 2 (log m) 1.5 U 4 h √ m 2 a + 3 (16c 1 (c 6 + c 7 ) a log m + 2 ) ≤ 3 + 2592π 1.5 c 3 1 c 2 (log m) 1.5 U 4 h √ m 2 a + 3 (16c 1 (c 6 + c 7 ) a log m + 2 ) Taking c 1 = 100, c 2 = 2, c 5 = 1000, c 6 = 100, c 7 = 100, a = 6000 log m ≤ , with at least 0.95 - T t=1 exp -64η 2 m 2 ∆2 t 2 π -exp - 2 m 128c 2 1 U 2 h log m probability, 1 T T -1 t=0 E sgd [ L(f t , X )] -L(F * , X ) ≤ 3 + 2592π 1.5 c 3 1 c 2 (log m) 1.5 U 4 h √ m 2 a + O ( ) ≤ 3 + O 2 U 2 h + O ( ) For any ∈ [0, 1], with probability at least 0.96- T t=1 exp -64η 2 m 2 ∆2 t 2 π -exp - 2 m 128c 2 1 U 2 h log m probability, 1 T T -1 t=0 E sgd [ L(f t , X )] -L(F * , X ) ≤ O( ) To find lower bound on probability, we use T t=1 1 t 2 ≤ ∞ t=1 1 t 2 ≤ 2. T t=1 exp - 64η 2 m 2 ∆2 t 2 π (i) ≤ T t=1 π 64η 2 m 2 ∆2 t 2 ≤ π m ∆2 2 32 2 m 2 ∆2 ≤ π ∆2 32 2 ≤ π 3200 ≤ 0.01 where inequality (i) follows from exp (-x) ≤ 1 x for all x ≥ 0. Finally, with at least 0.95exp - 2 m 128c 2 1 U 2 h log m probability, 1 T T -1 t=0 E sgd [ L(f t , X )] -L(F * , X ) ≤ O( )

E GENERALIZATION

In this section, we prove generalization guarantees to complement our optimization result, and complete the proof of our main theorem (Theorem E.1) about efficiently learning distributions using univariate normalizing flows. The proof in this section can be divided broadly in two parts. First, we prove that empirical average of L (f t , x) and L (F * , x) on training examples are close to expectation of L (f t , x) and L (F * , x) with respect to underlying data distribution, respectively. The similar argument is also used in Allen-Zhu et al. (2019) . Second, we prove that L (f t , x) and L (F * , x) are close to L (f t , x) and L (F * , x), respectively. Recall that the approximate loss function L is given by L (f t , x) = Q i=1 ∆ x φ (N t (τ i (x))) -log(φ (N t (x))) where N t (x) = m r=1 a r0 σ (w r0 + w t r )x + b r0 + b t r Lemma E.1. (Empirical Rademacher complexity for two-layer neural network) For every B > 0, for every n ≥ 1, with at least 1 -1 c1 probability over random initialization, the empirical Rademacher complexity is bounded by 1 n E ξ∈{±1} n sup max r∈[m] |wr|,|br|≤B n i=1 ξ i N (x i ) ≤ 8c 1 a Bm √ 2 log m √ n Proof. Using part a of Lemma H.5, we get that {x → w r x + b r | |w r | ≤ B, |b r | ≤ B} has Rademacher complexity 2B √ n . Using part b of Lemma H.5, we get that {x → ((w r0 + w r ) x + (b r0 + b r )) | |w r | ≤ B, |b r | ≤ B, w r0 , b r0 ∼ N 0, 1 m } has Rademacher complexity 2B √ n . Using part c of Lemma H.5, we get that class of functions in F = {x → N (x) | max r∈[m] |w r | ≤ B, max r∈[m] |b r | ≤ B} has Rademacher complexity R (X ; F) ≤ 2 a 1 2B √ n (i) ≤ 8c 1 a Bm √ 2 log m √ n where inequality (i) follows from Lemma H.2 with at least 1 -1 c1 probability over random initialization. Define upper bound on maximum and lower bound on minimum value of loss function L is given by sup x L (F * , x) = sup x Q i=1 ∆ x F * (τ i (x)) -log(F * (x)) ≤ 2M F * -log (m F * ) := M L (E.1) inf x L (F * , x) = inf x Q i=1 ∆ x F * (τ i (x)) -log(F * (x)) ≥ 2m F * -log (M F * ) := m L (E.2) Lemma E.2. Suppose n is sufficiently high such that it satisfies following condition. n ≥ O M L -m L 2 (Q + 1) 2 U 4 h (log m) 2 2 If n satisfies above condition, then with at least 0.98 probability over random initialization, population loss of any functions of set {x → N (x) | |w r | ≤ η ∆T, |b r | ≤ η ∆T ∀r ∈ [m]} is close to empirical loss i.e. sup N ∈F E x∈D L (f t , x) - 1 n n i=1 L (f t , x) ≤ Proof. Note that L (f t , x) depends on neural network N t (x) through (N t (τ 1 (x)) , N t (τ 2 (x)) , . . . , N t (τ Q (x)) , N t (x)) vector. Using Fact H.8, with at least 1 -δ probability, we get sup N ∈F E x∼D L (f t , x) - 1 n n i=1 L (f t , x) ≤ 2 √ 2L s (Q + 1) R (X ; F) + b log 1 δ 2n (E.3) where F = {x → N (x) | |w r | ≤ η ∆T, |b r | ≤ η ∆T ∀r ∈ [m]}. We get coordinate wise Lipschitz continuity of loss L function as following. L j ≤ sup N ∈F ,|x|≤1 |∆ x φ (N (τ j (x)))| ≤ sup N ∈F ,|x|≤1 1 Q |φ (N (τ j (x)))| ≤ 2 Q ∀i ∈ [Q] L Q+1 ≤ sup N ∈F ,|x|≤1 φ (N (x)) φ(N (x)) = sup N ∈F ,|x|≤1 exp (N (x)) I [N (x) ≤ 0] + I [N t (x) ≥ 0] exp (N (x)) I [N (x) ≤ 0] + (N (x) + 1) I [N (x) ≥ 0] ≤ sup N ∈F ,|x|≤1 I [N (x) ≤ 0] + 1 N (x) + 1 I [N (x) ≥ 0] ≤ 1 Using Lemma H.4, standard Lipschitz constant of L is L s ≤ Q+1 i=1 L 2 i ≤ 4 Q + 1 ≤ 2 (E.4) To get upper bound on L, we use Lipschitz property of L. L (f t , x) -L f t , x ≤ Q i=1 ∆ x |N (τ i (x))| + |N (x)| (E.5) Note that L (f t , x) depends upon (N (τ 1 (x)) , N (τ 2 (x)) , . . . , N (τ Q (x)) , N (x)) vector and similarly, L f t , x depends upon (0, 0, 0, . . . , 0, 0) Finding upper bound N (x) for all x ∈ [-1, 1], sup N ∈F ,x∈[-1,1] N (x) ≤ sup |wr|≤η ∆T,|br|≤η ∆T,x∈[-1,1] P t (x) + ∆ T np ≤ sup |wr|≤η ∆T,|br|≤η ∆T,x∈[-1,1] m r=1 a r0 σ (w r0 x + b r0 ) + m r=1 a r0 (w r x + b r ) σ (w r0 x + b r0 ) + ∆ T np (i) ≤ 16c 1 (c 6 + c 7 ) a log m + m 2c 1 a 2 log m 2η ∆T + ∆ T np ≤ 16c 1 (c 6 + c 7 ) a log m + m 2c 1 a 2 log m 2η ∆T + ∆ T np (ii) ≤ 16c 1 (c 6 + c 7 ) a log m + m 48c 2 1 2 a log m U 2 h π 4m 2 a + ∆ T np ≤ O U 2 h log m where inequality (i) uses Eq. (B.9), Lemma H.2 and Eq.(C.6). The inequality (ii) uses our choices of η and T from Eq.(D.4) and lower bound on m from Eq.(D.5). Define K as upper bound on sup N ∈F ,x∈[-1,1] N (x). K := O U 2 h log m (E.6) Using upper bound on sup N ∈F ,x∈[-1,1] N (x) and Eq.(E.5), we get upper bound on L(b). b = K + K + L (0, 0, ..., 0, 0) ≤ 2K + 2 Using value of b in Eq.(E.3) and Lemma E.1, with at least 1 -δ -1 c1 probability, we get sup N ∈F E x∈D L (f t , x) - 1 n n i=1 L (f t , x i ) ≤ 4 √ 2 (Q + 1) 8c 1 a η ∆T m √ 2 log m √ n + (2K + 2) log 1 δ 2n We use δ = 0.01 and choose n which satisfies following condition. n ≥ O M L -m L 2 (Q + 1) 2 U 4 h (log m) 2 2 (E.7) Using above n, with at least 0.98 probability, we get sup N ∈F E x∈D L (f t , x) - 1 n n i=1 L (f t , x i ) ≤ Lemma E.3. (Concentration on approximated loss of target function) Suppose n is sufficiently high such that it satisfies following condition. n ≥ O M L -m L 2 (Q + 1) 2 U 4 h (log m) 2 2 If n satisfies above condition, then with at least 0.9999 probability, population loss of target function F * is close to empirical loss i.e. E x∼D L (F * , x) -L (F * , X ) ≤ Proof. Finding minimum value (m L) and maximum value (M L) of loss function L, sup x L (F * , x) = sup x Q i=1 ∆ x F * (τ i (x)) -log(F * (x)) ≤ 2M F * -log (m F * ) = M L inf x L (F * , x) = inf x Q i=1 ∆ x F * (τ i (x)) -log(F * (x)) ≤ 2m F * -log (M F * ) = m L where M F * = max x∈[-1,1] F * (x) and m F * = max x∈[-1,1] F * (x). Using Hoeffding's inequal- ity, Pr E x∼D L (F * , X ) -L (F * , X ) ≥ ≤ exp - 2n 2 M L -m L 2 Taking n as n ≥ O M L -m L 2 (Q + 1) 2 U 4 h (log m) 2 2 With at least probability 1 -exp -2n 2 (ML-mL) 2 , E x∼D L (F * , x) -L (F * , X ) ≤ (E.8) Corollary E.1. Under same setting as Lemma D.2 and n ≥ O M L -m L 2 (Q + 1) 2 U 4 h (log m) 2 2 then with at least 0.92 -2 exp -m 8 2 a -exp - 2 m 128c 2 1 U 2 h log m -exp -2n 2 (ML-mL) 2 probability, we get E sgd 1 T T -1 t=0 E x∼D L(f t , x) -E x∼D L(F * , x) ≤ O( ) Proof. Using Lemma D.2, Lemma E.2 and Lemma E.3, with at least 0.92 -2 exp -m 8 2 a - exp - 2 m 128c 2 1 U 2 h log m -exp -2n 2 (ML-mL) 2 probability, we get  E sgd 1 T T -1 t=0 E x∼D L(f t , x) -E x∼D L(F * , x) ≤ O( ) 2 (Q+1) 2 U 4 h (log m) 2 2 , with at least 0.92-2 exp -m 8 2 a -exp - 2 m 128c 2 1 U 2 h log m - exp -2n 2 (ML-mL) 2 probability, we have E sgd 1 T T -1 t=0 E x∼D [L(f t , x)] -E x∼D [L(F * , x)] ≤ O( ) where U h is the complexity of target function defined in B.1 and M F * = sup x∈[-1,1] F * (x) M F * = sup x∈[-1,1] F * (x) m F * = inf x∈[-1,1] F * (x) M L = 2M F * -log (m F * ) m L = 2m F * -log (M F * ) K 2 = O U 2 h (log m) 1.5 Proof. First, we will try to bound L(F * , x) -L(F * , x) ≤ Q i=1 ∆ x F * (τ i (x)) -F * (x) ≤ 2M F * Q Similarly, bounding error for f t , we will get L(f t , x) -L(f t , x) ≤ Q i=1 ∆ x f t (τ i (x)) -f t (x) ≤ 2 (sup x f t (x)) Q To get sup x f t (x), we will use Eq.(E.6). sup x f t (x) ≤ sup x |N t (x)| ≤ sup x m r=1 a r0 σ (w r0 + w t r )x + b r0 + b t r w r0 + w t r ≤ sup x m r=1 a r0 w r0 + w t r I (w r0 + w t r )x + b r0 + b t r ≥ 0 ≤ sup x r∈H a r0 w r0 + w t r I [w r0 x + b r0 ≥ 0] + r / ∈H a r0 w r0 + w t r I (w r0 + w t r )x + b r0 + b t r ≥ 0 ≤ sup x r∈H a r0 w r0 I [w r0 x + b r0 ≥ 0] + r∈H a r0 w t r I [w r0 x + b r0 ≥ 0] + r / ∈H a r0 w r0 I (w r0 + w t r )x + b r0 + b t r ≥ 0 + r / ∈H a r0 w t r I (w r0 + w t r )x + b r0 + b t r ≥ 0 (i) ≤ 16c 1 c 6 a (log m) + m 2c 1 a 2 log m η ∆T + c 2 4 √ 2ηm √ m ∆t √ π 2c 1 a 2 log m 2c 6 √ 2 log m √ m + c 2 4 √ 2ηm √ m ∆t √ π 2c 1 a 2 log m η ∆T ≤ O ( ) + m 2c 1 a 2 log m ∆ U 2 h π 4m 2 a + c 2 4 √ 2m √ m ∆ √ π 2c 1 a 2 log m 2c 6 √ 2 log m √ m U 2 h π 4m 2 a + c 2 4 √ 2m √ m ∆ √ π 2c 1 a 2 log m ∆ U 2 h π 4m 2 a 2 ≤ O ( ) + O U 2 h log m + O U 2 h (log m) 1.5 + O ( ) ≤ O U 2 h (log m) 1.5 Define K 2 as upper bound on sup x f t (x), K 2 = O U 2 h (log m) 1.5 Taking Q as Q ≥ 2M F * + 2K 2 (E.9) Using given value of Q, we get that L(F * , x) -L(F * , x) ≤ (E.10) L(f t , x) -L(f t , x) ≤ (E.11) Using these relations, we get E sgd 1 T T -1 t=0 E x∼D [L(f t , x)] -E x∼D [L(F * , x)] ≤ O( ) By definition of KL divergence, we get E sgd 1 T T -1 t=0 KL (p F * ,Z ||p ft,Z ) ≤ O( ) F PROBLEM IN TRAINING OF CONSTRAINED NORMALIZING FLOW In this section, we provide reasons and details why changing initialization will not solve the problem (described in section 2.1 ) in the training of Constrained Normalizing Flow. The neural network in Constrained Normalizing Flow (CNF) is defined as N (x) = τ m r=1 a r0 tanh ((w r0 + w r ) x + (b r + b r0 )) , with constraints w r0 + w r ≥ , for all r. Here, > 0 is a small constant and τ is a normalization constant which only depends on m. The pseudo network for this neural network will be P (x) = τ m r=1 a r0 tanh(w r0 x + b r0 ) + tanh (w r0 x + b r0 ) (w r x + b r ) , with constraints w r0 + w r ≥ , for all r. We decompose P (x) into two parts: P (x) = P c (x) + P (x), where Note that P c (x) only depends upon initialization and does not depend on w r and b r . Hence, it can not approximate the target function after the training, therefore P (x) needs to approximate target function with P c (x) subtracted. Note that normalization constant τ is necessary to keep P (x) as same order of F * (x) with P c (x) subtracted. Note that Now, we will show that P (x) can not approximate "sufficiently non-linear" functions. Using half-normal distribution initialization of w r0 with (mean, variance) of normal distribution as 0,  P c (x) = τ x + b r0 | is O √ log m √ m . Using the fact that tanh (y) ≈ 1 for small y, we get that tanh (w r0 x + b r0 ) ≈ 1 for sufficient large m. In such cases, P (x) becomes linear function in x and won't be able to approximate sufficiently non-linear function. Using other variance of initializations leads to problem in other parts of the proof. Remark F.1. Using different variance in initialization of w r0 and b r0 will not solve the problem in the training of CNFs (mentioned in section 2.1). Informal Proof: As described above, if we use variance of initialization of w r0 and b r0 as 1 m , then P (x) cannot approximate "sufficiently non-linear" target function. The same problem will remain if variance of initialization of w r0 and b r0 is o 1 log m . For variance of initialization of w r0 and b r0 is Ω 1 log m and O(1), we will prove by contradiction that there doesn't exist any pseudo network with sufficiently small norm on θ * 2 for which the pseudo network can approximate the target function and for bigger norm on θ * 2 , pseudo network P (x) during the training does not stay close to N (x). Remark F.2. There doesn't exist any pseudo network with sufficiently small norm on θ * 2 for which the pseudo network can approximate the target function and for bigger norm on θ * 2 , pseudo network P (x) during the training does not stay close to N (x). Informal Proof: Suppose there exist P * (x) function which approximates the target function F * (x). As described earlier, P * c (x) only depends upon initialization and does not depend on w r and b r . Hence, it can not approximate the target function after the training, therefore P * (x) needs to approximate target function with P * c (x) subtracted. Hence, P * c (x), P * (x) and F * (x) should be Θ (1) . From the condition that P * c (x) needs to be O(1), we get that τ will be o 1 σ wb m for the considered range of variance of w r0 and b r0 where σ wb is variance of w r0 and b r0 . We denote variance of a r0 as σ a . From the condition that P * (x) needs to be Θ(1), we need τ m r=1 a r0 tanh (w r0 x + b r0 ) (w * r x + b * r ) = Θ (1) =⇒ τ mσ a log m θ * ∞ = Θ (1) =⇒ θ * ∞ = Θ 1 τ mσ a √ log m Using norm equivalance, we need θ * 2 = Θ 1 τ √ mσ a √ log m Doing similar calculation as in Lemma D.2 for the loss of normalizing flow L G , we will get Eq.(D.3) and from Eq.(D.3), to get small training error, we need θ * 2 2 ηT = O ( ) =⇒ ηT = O θ * 2 2 =⇒ ηT = O 1 mτ 2 σ 2 a log m (F.1) Now, to find coupling of function N (x) and P (x) as well as derivative of function N (x) and P (x), we find upper bound on the derivative of loss function and w r and b r . ∂L G (f t , x) ∂w r =τ N t (x)(a r0 σ ((w r0 + w t r )x + b r0 + b t r )) - τ N (x) (a r0 (σ ((w r0 + w t r )x + b r0 + b t r ) + (w r0 + w t r )xσ ((w r0 + w t r )x + b r0 + b t r ))) We assume that L G (f t , x) is L1 -lipschitz continuous wrt N and L2 -lipschitz continuous wrt N . Assuming |σ (.)| ≤ 1 and |x| ≤ 1, ∂L G (f t , x) ∂w r ≤ τ L1 a r0 + τ L2 a r0 1 + |w t r + w r0 ||σ ((w r0 + w t r )x + b r0 + b t r )| Assuming |σ (.)| ≤ C, ∂L G (f t , x) ∂w r ≤ τ L1 a r0 + τ L2 a r0 1 + C(|w t r | + w r0 ) Using Lemma H.2 for a r0 and w r0 , with at least 1 -1 c1 -1 c2 probability, ∂L G (f t , x) ∂w r ≤ 2c 1 σ a τ 2 log m L1 + L2 1 + C |w t r | + 2c 2 σ wb 2 log m (F.2) For projected gradient descent, |w t r | ≤ η t-1 i=0 ∂L G (f t , x i ) ∂w r ≤ η t-1 i=0 2c 1 σ a τ 2 log m L1 + L2 + 2c 2 Cσ wb L2 2 log m + 2c 1 σ a τ 2 log m L2 C|w i r | ≤ 2ηc 1 σ a τ 2 log m L1 + L2 + 2c 2 C L2 σ wb 2 log m t + 2ηc 1 σ a τ 2 log m L2 C t-1 i=0 |w i r | Define α and β as α = 2ηc 1 τ σ a 2 log m L1 + L2 + 2c 2 C L2 σ wb 2 log m β = 2ηc 1 σ a τ 2 log m L2 C Using α and β, |w t r | ≤ αt + β t-1 i=0 |w i r | where t-1 i=0 |w i r | ≤ α(t -1) + (1 + β) t-2 i=0 |w i r | ≤ α ((t -1) + (1 + β)(t -2)) + (1 + β) 2 t-3 i=0 |w i r | ≤ α (t -1) + (1 + β)(t -2) + (1 + β) 2 (t -3) + (1 + β) 3 t-4 i=0 |w i r | In general, we can write t-1 i=0 |w i r | ≤ α   t-t -1 i=1 (1 + β) i-1 (t -i)   + (1 + β) (t-t -1)   t i=0 |w i r |   Taking t = 0, t-1 i=0 |w i r | ≤ α t-1 i=1 (1 + β) i-1 (t -i) t-1 i=1 (1 + β) (i-1) (t -i) is sum of an arithmetic-geometric progression (AGP). Using Fact H.6, t-1 i=0 |w i r | ≤ α t-1 i=1 (1 + β) i-1 (t -i) = α (t -1) -(1 + β) t-1 -β - (1 + β) 1 -(1 + β) t-2 β 2 = α β(1 + β) t-1 -β(t -1) -(1 + β) + (1 + β) t-1 β 2 = α (1 + β) t -(1 + βt) β 2 (F.3) Using Eq.(F.3) to bound |w t r |, |w t r | ≤ α t + β (1 + β) t -(1 + βt) β 2 = α (1 + β) t -1 β = L1 + L2 + 2c 2 C L2 σ wb √ 2 log m L2 C 1 + 2ηc 1 σ a τ 2 log m L2 C t -1 = ∆ t w Similarly, we ∂L G (f t , x) ∂b r = N t (x)τ a r0 σ ((w r0 + w t r )x + b r0 + b t r ) - τ N t (x) a r0 (w r0 + w t r )σ ((w r0 + w t r )x + b r0 + b t r ) We assume that L G (f t , x) is L1 -lipschitz wrt N t (x) and L2 -lipschitz wrt N t (x). Additionaly, assuming that |σ (.)| ≤ 1 and |σ (. )| ≤ C, ∂L G (f t , x) ∂b r ≤ L1 a r0 τ + L2 a r0 Cτ w r0 + |w t r | Using Lemma H.2 for a r0 and w r0 , with at least 1 - 1 c1 -1 c2 probability, ∂L G (f t , x) ∂b r ≤ 2c 1 σ a 2 log m τ L1 + L2 C|w r | + 2c 2 C L2 σ wb 2 log m (F.4) For projected gradient descent, |b t r | ≤ η t-1 i=0 ∂L G (f t , x i ) ∂b r = 2ηc 1 σ a τ 2 log m L1 + 2c 2 L2 Cσ wb 2 log m t + 2ηc 1 σ a τ L2 C 2 log m t-1 i=0 |w i r | Using Eq.(F.3), |b t r | ≤ 2ηc 1 σ a τ 2 log m L1 + 2c 2 L2 Cσ wb 2 log m t + 2ηc 1 σ a L2 C 2 log m L1 + L2 + 2c 2 C L2 σ wb √ 2 log m 2ηc 1 σ a √ 2 log m L2 2 C 2 1 + 2ηc 1 τ σ a 2 log m L2 C t -1 = 2ηc 1 σ a τ 2 log m L1 + 2c 2 L2 Cσ wb 2 log m t + L1 + L2 + 2c 2 C L2 σ wb √ 2 log m L2 C 1 + 2ηc 1 σ a τ 2 log m L2 C t -1 = ∆ t b Finding lower bound on ∆ t b , we get ∆ t b ≥ 2ηc 1 σ a τ 2 log m L1 + 2c 2 L2 Cσ wb 2 log m t + L1 + L2 + 2c 2 C L2 σ wb √ 2 log m L2 C 1 + 2ηtc 1 σ a τ 2 log m L2 C -1 = 2ηc 1 σ a τ 2 log m L1 + 2c 2 L2 Cσ wb 2 log m t + L1 + L2 + 2c 2 C L2 σ wb √ 2 log m L2 C 2ηtc 1 σ a τ 2 log m L2 C = Ω (ηtσ a τ σ wb log m) We get similar lower bound on ∆ t w . ∆ t w = Ω (ηtσ a τ σ wb log m) Now, we show coupling between N t (x) and P t (x). Assuming that σ is C-lipschitz continuous, |N t (x) -P t (x)| = τ m r=1 a r0 (w t r + w r0 ) σ ((w r0 + w t r )x + b r0 + b t r ) -σ (w r0 x + b r0 ) -σ (w r0 x + b r0 ) w r0 (w t r x + b t r ) ≤ Cτ m r=1 2a r0 |w t r + w r0 | + Cτ m r=1 a r0 w r0 |w t r x + b t r | = Cτ m r=1 a r0 2|w t r + w r0 | + w r0 |w t r x + b t r | (i) ≤ Cτ 2c 1 σ a 2 log m W t 1 + W 0 1 + 2c 2 σ wb 2 log m W t 1 + B t 1 = O τ σ a σ wb m log m ∆ t w + ∆ t b where inequality (i) follows from Fact H.2. Now, using t = T to find upper bound on |N T (x) -P T (x)|, we get |N T (x) -P T (x)| = O τ σ a σ wb m log m ∆ T w + ∆ T b = O τ σ a σ wb m log m ∆ T w + ∆ T b = O τ σ a σ wb m log m ∆ T w + ∆ T b = O (τ σ a σ wb m log m (ηT σ a τ σ wb log m)) = O ηT τ 2 σ 2 a σ 2 wb m (log m) 2 = O τ 2 σ 2 a σ 2 wb m (log m) 2 mτ 2 σ 2 a log m = O σ 2 wb (log m) For σ wb in range Ω 

G ADDITIONAL EXPERIMENTS

In this section, we show experimental results on synthetic 1D data to support our theoretical findings. We use the same architecture and initialization as described in Sec. 2 for both constrained and unconstrained normalizing flows. In all our experiments, we fix the weights of the output layer and train the weights and biases of the hidden layer. For training, we use mini-batch SGD with batch size 32. We use 2 datasets each with 10,000 data points. One of the dataset is mixture of 2 Gaussians and other one is mixture of 3 beta distributions. All results are averaged over 3 different iterations.

G.1 RESULTS FOR UNCONSTRAINED NORMALIZING FLOW

In Fig. 2 , we show comparison of data distribution and generated data distribution for unconstrained normalizing flow. Unconstrained normalizing flow with exponential distribution as a base distribution learns the data distribution well. We study the effect of overparameterization on L2-norm of W t and B t and convergence speed. To reproduce situation similar to the theoretical analyses for unconstrained normalizing flow, we choose learning rate as c m where c is a constant. The first row of Figure 3 contains results for mixture of Gaussians dataset and the second row contains results for mixtures of beta distributions dataset. From Fig. 3 , we see that L2-norm of W t and B t decreases with increasing m. Moreover, the change is proportional to 1/ √ m which is similar to the bound in theoretical result. From the last column of Fig. 3 , we see that the training speed for different values of m remains almost constant. Our choice of T in theoretical analysis also poly-logarithmically depends upon m. We obtained similar results for Gaussian distribution as well. In Sec. 2.1, we suggested that high overparameterization may adversely affect training for constrained normalizing flows. We now give experimental evidence for this in Figs 4, 5 and 6. We use Gaussian distribution as a base distribution for all our experiments of constrained normalizing flow. In Fig. 4 , we see that for neural network with m = 100 and m = 400, the training loss decreases stably. In Figs. 5 and 6 , we see that as we increase the learning rate, training becomes more stable for larger m. Note that for learning rate equal to 0.025 and 0.0125, constrained normalizing flow with m = 1600 doesn't learn anything due to small learning rate. We observe that the L2-norms of W t and B t for m = 6400 are at least at large as those of m = 1600. On both datasets, as we increase the learning rate, L2-norm of B t increases (except for learning rate=0.05 of mixture of beta distribution) and learning of constrained normalizing flow becomes more and more stable. We also experimented with more number of epochs (1000 epochs) for mixture of Gaussian dataset for number of hidden layers m = 1600, 6400 (see Fig. 7 ). These observations support our claim in Sec.2.1 that for learning and approximation of overparameterized constrained normalizing flow, neural networks need large L2-norms of W t and B t .

H USEFUL FACTS

Lemma H.1. Suppose Z k ∼ N (0, σ 2 ) and Y = n k=1 Z 2 k is chi-squared distribution with following property for all t ∈ (0, 1).

Pr

1 n n k=1 Z 2 k -σ 2 ≥ t ≤ 2 exp - nt 2 8σ 4 Proof. From example 2.11 from Wainwright (2019) , for Z k ∼ N (0, 1) and Y = n k=1 Z 2 k is chi-squared distribution with following property for all t ∈ (0, 1). Lemma H.2. Let X 1 , X 2 , ..., X n be independent random variables from N (0, σ 2 ), then with at least 1 - Using R = R σ , we get the required result. where a is the initial term, d is the common difference and r is the common ratio. The sum of the first n terms of the AGP (S n ) is given by S n = a -[a + (n -1)d] r n 1 -r + dr 1 -r n-1 (1 -r) 2 Fact H.7. ( Lemma A.3 from Ji et al. ( 2020) ) The Fourier transform of f is defined as f (w) = f (x)e 2πiwx dx The polar decomposition of the Fourier transform f is f (w) = f (w) e 2πiθ f (w) with |θ f (w)| ≤ 1. The Fourier transform f follows below properties. 1. f (w) ≤ f L1 for any real number w. 



Work. Previous work on normalizing flows has studied different variants such as planar and radial flows in Rezende & Mohamed (2015), Sylvester flow in van den Berg et al. (2018), Householder flow in Tomczak & Welling (2016), masked autoregressive flow in Papamakarios et al.

Figure 1: Effect of over-parameterization on training of constrained normalizing flow on mixture of Gaussian dataset for number of hidden layers m = 1600, 6400

(One-dimensional version of Theorem 4.3 from Ji et al. (2020)) Let ψ : R → R and δ > 0 be given, and define

5)Using technique fromYehudai & Shamir (2019), we define h = h ((a 10 , w 10 , b 10 ) , . . . , (a r0 , w r0 , b r0 ) , . . . , (a 10 , w m0 , b m0 )) = sup x∈[-1,1] |P * (x) -E ar0,wr0,br0 [P * (x)]| We will use McDiarmid's inequality to bound h. h ((a 10 , w 10 , b 10 ) , . . . , (a r0 , w r0 , b r0 ) , . . . , (a 10 , w m0 , b m0 )) -h (a 10 , w 10 , b 10 ) , . . . , (a r0 , w r0 , b r0 ) , . . . , (a 10 , w m0 , b m0 ) ≤ 4c 1 U h √ 2 log m m Using Lemma 26.2 from Shalev-Shwartz & Ben-David (2014), we get

Using Lipschitz continuity of φ function, we get |φ (P * (x)) -φ(P * (x))| ≤ |P * (x) -P * (x)| ≤ m r=1 a r0 (w r0 x + b r0 ) I [w r0 x + b r0 ≥ 0] Now, there are at most m break points of indicator I [w r0 x + b r0 ≥ 0] where value of I [w r0 x + b r0 ≥ 0] changes. We can divide range of x into at most m + 1 subsets where in each subset, value of indicators I [w r0 x + b r0 ≥ 0] is fixed for all r. Suppose there are m indicators with value 1 in a given subset. Without loss of generality, we can assume that indicators from r = 1 to r = m is 1. Then, m r=1 a r0 (w r0 x + b r0 ) I [w r0 x + b r0 ≥ 0] = m r=1 a r0 (w r0 x + b r0 ) r0 b r0 Now, applying Hoeffding's inequality for the sum in above equation, we get Pr

that gradient-based optimization of the loss for the target function can be closely approximated by the gradient-based optimization of the pseudo network. Since the loss function of the pseudo network is convex in its parameters, we get global optimization. Lemma D.1. (Convexity of loss function of pseudo network) The loss function for pseudo network is convex with respect to parameters of neural network.

(loss function is close to optimal) For every ∈ (0, 1), there exist m > poly U h , 1 , η = Õ target function F * with finite second order derivative and number of quadrature points Q ≥ 4M F * +4K2 and number of training points n ≥ O (ML-mL)

r0 (tanh(w r0 x + b r0 )) and P (x) = τ m r=1 a r0 tanh (w r0 x + b r0 ) (w r x + b r ) .

m and O(1), |N T (x) -P T (x)| can become very high and in those cases, N T (x) and P T (x) willnot remain close.

Figure 2: Comparison of data distribution and generated data for mixture of Gaussian and beta distributions

Figure 5: Effect of over-parameterization on training of constrained normalizing flow on mixture of Gaussian dataset for number of hidden layers m = 1600, 6400

Suppose function f : R d → R is L g -Lipschitz continuous and L i -coordinate wise Lipschitz continuous i.e.|f (a) -f (b)| ≤L g ab ∀a, b ∈ R d (Standard Lipschitz continuity) |f (a 1 , a 2 , ..., a i , ..., a d ) -f (a 1 , a 2 , ..., b i , ..., a d )| ≤L i |a i -b i | ∀a 1 , a 2 , ..., a i , ..., a d , b i ∈ R and ∀i ∈ [d] (Coordinate-wise Lipschitz continuity)If a function f satisfies L i -coordinate wise Lipschitz continuity for all i, then function f follows following inequality.|f (a 1 , a 2 , ..., a d ) -f (b 1 , b 2 , ..., b d )| ≤ n i=1 L i |a i -b i |Moreover, the function f also satisfies standard Lipschitz continuity with L g Lipschitz constant where inequality between L g and L i is as follows. Define a = (a 1 , a 2 , ..., a d ) and b= (b 1 , b 2 , ..., b d ). |f (a 1 , a 2 , ..., a d ) -f (b 1 , b 2 , ..., b d )| ≤ |f (a 1 , a 2 , ..., a d ) -f (b 1 , a 2 , ..., a d )| + |f (b 1 , a 2 , a 3 , ..., a d ) -f (b 1 , b 2 , a 3 , ..., a d )| + |f (b 1 , b 2 , a 3 , ..., a d ) -f (b 1 , b 2 , b 3 , ..., a d )| + ... + |f (b 1 , b 2 , ..., b d-1 , a d ) -f (b 1 , b 2 , b 3 , ..., b d )| ≤L 1 |a 1 -b 1 | + L 2 |a 2 -b 2 | + ... + L d |a d -b d follows from Cauchy-Schwarz inequality.Fact H.1. (Hoeffding's inequality on Binomial random variable) If we have a binomial random variable with parameters n (total number of trials) and p (probability of success). For number successful trial k ≥ np, following inequality holds.Pr (X ≥ k) ≤ exp -2n k n -p 2 Fact H.2. (Half-normal distribution)If X follows a normal distribution with with mean 0 and variance σ 2 , N 0, σ 2 , then Y = |X| = Xsign (X) follows a half-normal distribution with meanE [Y ] = σ √ 2 √ π . Fact H.3. For a gaussian random variable X ∼ N (0, σ 2 ), ∀t ∈ (0, σ), we have Pr(|X| ≥ t) ≥ 1 -4t5σFact H.4. The sum of reciprocals of the squares of the natural numbers is given by 5. (Theorem 3.1(r 5 ) ofLi & Yeh (2013)) For any α > 1 and x ∈ 0, 1 α-(α -1) x Fact H.6. If Arithmetic-Geometric Progression(AGP) is as follows.a, (a + d)r, (a + 2d)r 2 , (a + 3d)r 3 , ...., [a + (n -1)d] r n-1

Let α > 0 be given and define β := 1 2πα . G α is Gaussian with coordinate-wise variance α 2 . Then Ĝα = Ĝα (meaning Ĝα has no radial component) and Let F be a set of functions R d → R and X = (x 1 , x 2 , ..., x n ) be a finite set of samples. The empirical Rademacher complexity of F with respect to X is defined by R (X ; F) = E ξ∼{±1} n sup The Rademacher complexity have following properties. a. Suppose |x| ≤ 1 for all X . The classF = {x → wx+b | |w| ≤ B, |b| ≤ B} has Rademacher complexity R (X , F) ≤ 2B √ n b. Given F 1 , F 2 classes of functions, then R (X ; F 1 + F 2 ) = R (X ; F 1 ) + R (X ; F 2 ) c. Given F 1 , F 2 , ..., F m classes of functions from X → R and suppose w ∈ R m is a fixed vector, then F = {x → m r=1 w r σ (f r (x)) | f r ∈ F r } satisfies R (X ; F ) ≤ 2 w 1 max r∈[m] R (X ; F r ) where σ is 1-Lipschitz continuous function.Proof. The b and c parts of the proposition are from Allen-Zhu et al. (2019) . Proof of the a part is as following. Using independence of ξ i for all i ∈ [n], we get Fact H.8. (Rademacher Complexity) If F 1 , F 2 , ..., F k are k classes of functions R d → R and L x : R d → [-b, b] is L g -Lipschitz continuous function for any x ∼ D, then sup f1∈F1,...,f k ∈F k E x∈D [L x (f 1 (x), ..., f k (x))] -1 n n i=1 L x (f 1 (x i ), ..., f k (x i )) ≤ 2 R (X ; L) + b log 1 δ 2n

1 , w 2 , ..., w m ) ∈ R m and B = (b 1 , b 2 , ..., b m ) ∈ R m of the neural network. We use Stochastic Gradient Descent (SGD) to update the parameters of neural networks. Denote by θ t = (W t , B t ) with W t = (w t

t 2 π where |H t | denotes the cardinality of set H t and similarly for |H c t |. Lemma C.2. (Bound on difference of f and g ) For every x in 1 radius (|x| ≤ 1) and for every time step t ≥ 1, with at least 1 -1 c1 probability, function with neural network and function with pseudo network are close for some positive constants c 1 > 1.

1 m and normal distribution initialization of b r0 with (mean, variance) as 0, 1 m , w r0 and |b r0 | are O

1 c1 probability, following holds.Assuming n ≥ 2, the last inequality follows. Using Markov's inequality, Lemma H.3. For standard Gaussian random variable X from N (0, σ 2 ), following anticoncentration inequality holds.

annex

where L is set of all functions L x . Using vector contraction inequality from Maurer (2016) , we get

