LEARNING AND GENERALIZATION IN UNIVARIATE OVERPARAMETERIZED NORMALIZING FLOWS

Abstract

In supervised learning, it is known that overparameterized neural networks with one hidden layer provably and efficiently learn and generalize, when trained using Stochastic Gradient Descent (SGD). In contrast, the benefit of overparameterization in unsupervised learning is not well understood. Normalizing flows (NFs) learn to map complex real-world distributions into simple base distributions and constitute an important class of models in unsupervised learning for sampling and density estimation. In this paper, we theoretically and empirically analyze these models when the underlying neural network is one hidden layer overparametrized network. On the one hand, we provide evidence that for a class of NFs, overparametrization hurts training. On the other hand, we prove that another class of NFs, with similar underlying networks, can efficiently learn any reasonable data distribution under minimal assumptions. We extend theoretical ideas on learning and generalization from overparameterized neural networks in supervised learning to overparameterized normalizing flows in unsupervised learning. We also provide experimental validation to support our theoretical analysis in practice.

1. INTRODUCTION

Neural network models trained using simple first-order iterative algorithms have been very effective in both supervised and unsupervised learning. Theoretical reasoning of this phenomenon requires one to consider simple but quintessential formulations, where this can be demonstrated by mathematical proof, along with experimental evidence for the underlying intuition. First, the minimization of training loss is typically a non-smooth and non-convex optimization over the parameters of neural networks, so it is surprising that neural networks can be trained efficiently by first-order iterative algorithms. Second, even large neural networks whose number parameters are more than the size of training data often generalize well with a small loss on the unseen test data, instead of overfitting the seen training data. Recent work in supervised learning attempts to provide theoretical justification for why overparameterized neural networks can train and generalize efficiently in the above sense. In supervised learning, the empirical risk minimization with quadratic loss is a non-convex optimization problem even for a fully connected neural network with one hidden layer of neurons with ReLU activations. Around 2018, it was realized that when the hidden layer size is large compared to the dataset size or compared to some measure of complexity of the data, one can provably show efficient training and generalization for these networks, e.g. Jacot et al. ( 2018 



); Li & Liang (2018); Du et al. (2018); Allen-Zhu et al. (2019); Arora et al. (2019). Of these, Allen-Zhu et al. (2019) is directly relevant to our paper and will be discussed later. The role of overparameterization, and provable training and generalization guarantees for neural networks are less well understood in unsupervised learning. Generative models or learning a data distribution from given samples is an important problem in unsupervised learning. Popular generative models based on neural networks include Generative Adversarial Networks (GANs) (e.g., Goodfellow et al. (2014)), Variational AutoEncoders (VAEs) (e.g., Kingma & Welling (2014)), and Normalizing Flows (e.g., Rezende & Mohamed (2015)). GANs and VAEs have shown impressive capability to generate samples of photo-realistic images but they cannot give probability density estimates for new data points. Training of GANs and VAEs has various additional challenges such as mode collapse, posterior collapse, vanishing gradients, training instability, etc. as shown in e.g. Bowman et al. (2016); Salimans et al. (2016); Arora et al. (2018); Lucic et al. (2018).

