LEARNING AND GENERALIZATION IN UNIVARIATE OVERPARAMETERIZED NORMALIZING FLOWS

Abstract

In supervised learning, it is known that overparameterized neural networks with one hidden layer provably and efficiently learn and generalize, when trained using Stochastic Gradient Descent (SGD). In contrast, the benefit of overparameterization in unsupervised learning is not well understood. Normalizing flows (NFs) learn to map complex real-world distributions into simple base distributions and constitute an important class of models in unsupervised learning for sampling and density estimation. In this paper, we theoretically and empirically analyze these models when the underlying neural network is one hidden layer overparametrized network. On the one hand, we provide evidence that for a class of NFs, overparametrization hurts training. On the other hand, we prove that another class of NFs, with similar underlying networks, can efficiently learn any reasonable data distribution under minimal assumptions. We extend theoretical ideas on learning and generalization from overparameterized neural networks in supervised learning to overparameterized normalizing flows in unsupervised learning. We also provide experimental validation to support our theoretical analysis in practice.

1. INTRODUCTION

Neural network models trained using simple first-order iterative algorithms have been very effective in both supervised and unsupervised learning. Theoretical reasoning of this phenomenon requires one to consider simple but quintessential formulations, where this can be demonstrated by mathematical proof, along with experimental evidence for the underlying intuition. First, the minimization of training loss is typically a non-smooth and non-convex optimization over the parameters of neural networks, so it is surprising that neural networks can be trained efficiently by first-order iterative algorithms. Second, even large neural networks whose number parameters are more than the size of training data often generalize well with a small loss on the unseen test data, instead of overfitting the seen training data. Recent work in supervised learning attempts to provide theoretical justification for why overparameterized neural networks can train and generalize efficiently in the above sense. In supervised learning, the empirical risk minimization with quadratic loss is a non-convex optimization problem even for a fully connected neural network with one hidden layer of neurons with ReLU activations. Around 2018, it was realized that when the hidden layer size is large compared to the dataset size or compared to some measure of complexity of the data, one can provably show efficient training and generalization for these networks, e.g. Jacot et al. ( 2018 In contrast to the generative models such as GANs and VAEs, when normalizing flows learn distributions, they can do both sampling and density estimation, leading to wide-ranging applications as mentioned in the surveys by Kobyzev et al. (2020) and Papamakarios et al. (2019) . Theoretical understanding of learning and generalization in normalizing flows (more generally, generative models and unsupervised learning) is a natural and important open question, and our main technical contribution is to extend known techniques from supervised learning to make progress towards answering this question. In this paper, we study learning and generalization in the case of univariate overparameterized normalizing flows. Restriction to the univariate case is technically non-trivial and interesting in its own right: univariate ReLU networks have been studied in recent supervised learning literature (e.g., Savarese et al. ( 2019 2019)). Multidimensional flows are qualitatively more complex and our 1D analysis sheds some light on them (see Sec. 4). Before stating our contributions, we briefly introduce normalizing flows; details appear in Section 2. Normalizing Flows. We work with one-dimensional probability distributions with continuous density. The general idea behind normalizing flows (NFs), restricted to 1D can be summarized as follows: Let X ∈ R be a random variable denoting the data distribution. We also fix a base distribution with associated random variable Z which is typically standard Gaussian, though in this paper we will work with the exponential distribution as well. Given i.i.d. samples of X, the goal is to learn a continuous strictly monotone increasing map f X : R → R that transports the distribution of X to the distribution of Z: in other words, the distribution of f -1 X (Z) is that of X. The learning of f X is done by representing it by a neural network and setting up an appropriate loss function. The monotonicity requirement on f which makes f invertible, while not essential, greatly simplifies the problem and is present in all the works we are aware of. It is not clear how to set up a tractable optimization problem without this requirement. Since the function represented by standard neural networks are not necessarily monotone, the design of the neural net is altered to make it monotone. For our 1D situation, one-hidden layer networks are of the form N (x) = m i=1 a i σ(w i x + b i ), where m is the size of the hidden layer and the a i , w i , b i are the parameters of the network. We will assume that the activation functions used are monotone. Here we distinguish between two such alterations: (1) Changing the parametrization of the neural network. This can be done in multiple ways: instead of a i , w i we use a 2 i , w 2 i (or other functions, such as the exponential function, of a i , w i that take on only positive values) (Huang et al., 2018; Cao et al., 2019) . This approach appears to be the most popular. In this paper, we also suggest another related alteration: we simply restrict the parameters a i , w i to be positive. This is achieved by enforcing this constraint during training. (2) Instead of using N (x) for f (x) we use φ(N (x)) for f (x) = df dx , where φ : R → R + takes on only positive values. Positivity of f implies monotonicity of f . Note that no restrictions on the parameters are required; however, because we parametrize f , the function f needs to be reconstructed using numerical quadrature. This approach is used by Wehenkel & Louppe (2019) . We will refer to the models in the first class as constrained normalizing flows (CNFs) and those in the second class as unconstrained normalizing flows (UNFs). Our Contributions. In this paper, we study both constrained and unconstrained univariate NFs theoretically as well as empirically. The existing analyses for overparametrized neural networks in the supervised setting work with a linear approximation of the neural network, termed pseudo network in Allen-Zhu et al. (2019) . They show that (1) there is a pseudo network with weights close to the initial ones approximating the target function, (2) the loss surfaces of the neural network and the pseudo network are close and moreover the latter is convex for convex loss functions. This allows for proof of the convergence of the training of neural network to global optima. One can try to adapt the approach of using a linear approximation of the neural network to analyze training of NFs. However, one immediately encounters some new roadblocks: the loss surface of the pseudo networks is non-convex in both CNFs and UNFs. In both cases, we identify novel variations that make the optimization problem for associated pseudo network convex: For CNFs, instead of using a 2 i , w 2 i as parameters, we simply impose the constraints a i ≥ and w i ≥ for some small constant . The optimization algorithm now is projected SGD, which in this case incurs essentially no extra cost over SGD due to the simplicity of the positivity constraints. Apart from making the optimization problem convex, in experiments this variation



); Li & Liang (2018); Du et al. (2018); Allen-Zhu et al. (2019); Arora et al. (2019). Of these, Allen-Zhu et al. (2019) is directly relevant to our paper and will be discussed later. The role of overparameterization, and provable training and generalization guarantees for neural networks are less well understood in unsupervised learning. Generative models or learning a data distribution from given samples is an important problem in unsupervised learning. Popular generative models based on neural networks include Generative Adversarial Networks (GANs) (e.g., Goodfellow et al. (2014)), Variational AutoEncoders (VAEs) (e.g., Kingma & Welling (2014)), and Normalizing Flows (e.g., Rezende & Mohamed (2015)). GANs and VAEs have shown impressive capability to generate samples of photo-realistic images but they cannot give probability density estimates for new data points. Training of GANs and VAEs has various additional challenges such as mode collapse, posterior collapse, vanishing gradients, training instability, etc. as shown in e.g. Bowman et al. (2016); Salimans et al. (2016); Arora et al. (2018); Lucic et al. (2018).

), Williams et al. (2019), Sahs et al. (2020) and Daubechies et al. (

