FORWARD SUPER-RESOLUTION: HOW CAN GANS LEARN HIERARCHICAL GENERATIVE MODELS FOR REAL-WORLD DISTRIBUTIONS

Abstract

Generative adversarial networks (GANs) are among the most successful models for learning high-complexity, real-world distributions. However, in theory, due to the highly non-convex, non-concave landscape of the minmax training objective, GAN remains one of the least understood deep learning models. In this work, we formally study how GANs can efficiently learn certain hierarchically generated distributions that are close to the distribution of real-life images. We prove that when a distribution has a structure that we refer to as forward super-resolution, then simply training generative adversarial networks using stochastic gradient descent ascent (SGDA) can learn this distribution efficiently, both in sample and time complexities. We also provide empirical evidence that our assumption "forward super-resolution" is very natural in practice, and the underlying learning mechanisms that we study in this paper (to allow us efficiently train GAN via SGDA in theory) simulates the actual learning process of GANs on real-world problems. 1

1. INTRODUCTION

Generative adversarial networks (GANs) (Goodfellow et al., 2014) are among the successful models for learning high-complexity, real-world distributions. In practice, by training a min-max objective with respect to a generator and a discriminator consisting of multi-layer neural networks, using simple local search algorithms such as stochastic gradient descent ascent (SGDA), the generator can be trained efficiently to generate samples from complicated distributions (such as the distribution of images). But, from a theoretical perspective, how can GANs learn these distributions efficiently given that learning much simpler ones are already computationally hard (Chen et al., 2022a) ? Answering this in full can be challenging. However, following the tradition of learning theory, one may hope for discovering some concept class consisting of non-trivial target distributions, and showing that using SGDA on a min-max generator-discriminator objective, not only the training converges in poly-time (a.k.a. trainability), but more importantly, the generator learns the target distribution to good accuracy (a.k.a. learnability). To this extent, we believe prior theory works studying GANs may still be somewhat inadequate. • Some existing theories focus on properties of GANs at the global-optimum (Arora et al., 2017; 2018; Bai et al., 2018; Unterthiner et al., 2017) ; while it remains unclear how the training process can find such global optimum efficiently. • Some theories focus on the trainability of GANs, in the case when the loss function is convexconcave (so a global optimum can be reached), or when the goal is only to find a critical point (Daskalakis & Panageas, 2018a; b; Gidel et al., 2018; Heusel et al., 2017; Liang & Stokes, 2018; Lin et al., 2019; Mescheder et al., 2017; Mokhtari et al., 2019; Nagarajan & Kolter, 2017) . Due to non-linear neural networks used in practical GANs, it is highly unlikely that the min-max training objective is convex-concave. Also, it is unclear whether such critical points correspond to learning certain non-trivial distributions (like image distributions). • Even if the generator and the discriminator are linear functions over prescribed feature mappings -such as the neural tangent kernel (NTK) feature mappings -see (Allen-Zhu et al., 2019b; Arora et al., 2019; Daniely et al., 2016; Du et al., 2018; Jacot et al., 2018; Zou et al., 2018) and the references therein -the training objective can still be non-convex-concave. • Some other works introduced notions such as proximal equilibria (Farnia & Ozdaglar, 2020) or added gradient penalty (Mescheder et al., 2018) to improve training convergence. Once again, they do not study the "learnability" aspect of GANs. In particular, Chen et al. (2022b) even explicitly argue that min-max optimality may not directly imply distributional learning for GANs. • Even worse, unlike supervised learning where some non-convex learning problems can be shown to haveno bad local minima (Ge et al., 2016) , to the best of our knowledge, it still remains unclear what the qualities are of those critical points in GANs except in the most simple setting when the generator is a one-layer neural network (Feizi et al., 2017; Lei et al., 2019) . (We discuss some other related works in distributional learning in the full version.) Motivate by this huge gap between theory and practice, in this work, we make a preliminary step by showing that, when an image-like distribution is hierarchically generated (using an unknown O(1)layered target generator) with a structural property that we refer to as forward super-resolution, then under certain mild regularity conditions, such distribution can be efficiently learned -both in sample and time complexity -by applying SGDA on a GAN objective.foot_1 Moreover, to justify the scope of our theorem, we provide empirical evidence that forward super-resolution holds for practical image distributions, and most of our regularity conditions hold in practice as well. We believe our work extends the scope of traditional distribution learning theory to the regime of learning continuous, complicated real-world distributions such as the distribution of images, which are often generated through some hierarchical generative models. We draw connections between traditional distribution learning techniques such as method of moments to the generator-discriminator framework in GANs, and shed lights on what GANs are doing beyond these techniques.

1.1. FORWARD SUPER-RESOLUTION: A SPECIAL PROPERTY OF IMAGES

Real images can be viewed in multiple resolutions without losing the semantics. In other words, the resolution of an image can be greatly reduced (e.g. by taking the average of nearby pixels), while still keeping the structure of the image. Motivated by this observation, the seminal work of Karras et al. (2018) proposes to train a generator progressively: the lower levels of the generator are trained first to generate the lower-resolution version of images, and then the higher levels are gradually trained to generate higher and higher resolution images. In our work, we formulate this property of images as what we call forward super-resolution: Forward super-resolution property (mathematical statement see Section 2.1): There exists a generator G as an L-hidden-layer neural network with ReLU activation, where each G ℓ represent the hidden neuron values at layer ℓ, and there exists matrices W ℓ such that the distribution of images at resolution level ℓ is given by W ℓ G ℓ and the randomness is taken over the randomness of the input to G (usually standard Gaussian). In plain words, we assume there is an (unknown) neural network G whose hidden layer G ℓ can be used to generate images of resolution level ℓ (larger ℓ means better resolution) via a linear transformation, typically a deconvolution. We illustrate that this assumption holds on practical GAN training in Figure 1 . This assumption is also made in the practical work (Karras et al., 2018) . Moreover, there is a body of works that directly use GANs or deconvolution networks for super-resolution (Bulat & Tzimiropoulos, 2018; Ledig et al., 2017; Lim et al., 2017; Wang et al., 2018; Zhang et al., 2018) .

2. PROBLEM SETUP

Throughout this paper, we use a = poly(b) for a > 0, b > 1 to denote that there are absolute constants C 1 > C 2 > 0 such that b C2 < a < b C1 . For a target learning error ε ∈ [ 1 d ω(1) , 1 poly(d) ],



Full version of this paper can be found on https://arxiv.org/abs/2106.02619. Plus a simple SVD warmup initialization that is easily computable from the covariance of image patches.

