FORWARD SUPER-RESOLUTION: HOW CAN GANS LEARN HIERARCHICAL GENERATIVE MODELS FOR REAL-WORLD DISTRIBUTIONS

Abstract

Generative adversarial networks (GANs) are among the most successful models for learning high-complexity, real-world distributions. However, in theory, due to the highly non-convex, non-concave landscape of the minmax training objective, GAN remains one of the least understood deep learning models. In this work, we formally study how GANs can efficiently learn certain hierarchically generated distributions that are close to the distribution of real-life images. We prove that when a distribution has a structure that we refer to as forward super-resolution, then simply training generative adversarial networks using stochastic gradient descent ascent (SGDA) can learn this distribution efficiently, both in sample and time complexities. We also provide empirical evidence that our assumption "forward super-resolution" is very natural in practice, and the underlying learning mechanisms that we study in this paper (to allow us efficiently train GAN via SGDA in theory) simulates the actual learning process of GANs on real-world problems. 1

1. INTRODUCTION

Generative adversarial networks (GANs) (Goodfellow et al., 2014) are among the successful models for learning high-complexity, real-world distributions. In practice, by training a min-max objective with respect to a generator and a discriminator consisting of multi-layer neural networks, using simple local search algorithms such as stochastic gradient descent ascent (SGDA), the generator can be trained efficiently to generate samples from complicated distributions (such as the distribution of images). But, from a theoretical perspective, how can GANs learn these distributions efficiently given that learning much simpler ones are already computationally hard (Chen et al., 2022a) ? Answering this in full can be challenging. However, following the tradition of learning theory, one may hope for discovering some concept class consisting of non-trivial target distributions, and showing that using SGDA on a min-max generator-discriminator objective, not only the training converges in poly-time (a.k.a. trainability), but more importantly, the generator learns the target distribution to good accuracy (a.k.a. learnability). To this extent, we believe prior theory works studying GANs may still be somewhat inadequate. • Some existing theories focus on properties of GANs at the global-optimum (Arora et al., 2017; 2018; Bai et al., 2018; Unterthiner et al., 2017) ; while it remains unclear how the training process can find such global optimum efficiently. • Some theories focus on the trainability of GANs, in the case when the loss function is convexconcave (so a global optimum can be reached), or when the goal is only to find a critical point (Daskalakis & Panageas, 2018a; b; Gidel et al., 2018; Heusel et al., 2017; Liang & Stokes, 2018; Lin et al., 2019; Mescheder et al., 2017; Mokhtari et al., 2019; Nagarajan & Kolter, 2017) . Due to non-linear neural networks used in practical GANs, it is highly unlikely that the min-max training objective is convex-concave. Also, it is unclear whether such critical points correspond to learning certain non-trivial distributions (like image distributions). • Even if the generator and the discriminator are linear functions over prescribed feature mappings -such as the neural tangent kernel (NTK) feature mappings -see (Allen-Zhu et al., 2019b; Arora et al., 2019; Daniely et al., 2016; Du et al., 2018; Jacot et al., 2018; Zou et al., 2018) and the references therein -the training objective can still be non-convex-concave. • Some other works introduced notions such as proximal equilibria (Farnia & Ozdaglar, 2020) or added gradient penalty (Mescheder et al., 2018) to improve training convergence. Once again, they do not study the "learnability" aspect of GANs. In particular, Chen et al. (2022b) even explicitly argue that min-max optimality may not directly imply distributional learning for GANs. • Even worse, unlike supervised learning where some non-convex learning problems can be shown to haveno bad local minima (Ge et al., 2016) , to the best of our knowledge, it still remains unclear what the qualities are of those critical points in GANs except in the most simple setting when the generator is a one-layer neural network (Feizi et al., 2017; Lei et al., 2019) . (We discuss some other related works in distributional learning in the full version.) Motivate by this huge gap between theory and practice, in this work, we make a preliminary step by showing that, when an image-like distribution is hierarchically generated (using an unknown O(1)layered target generator) with a structural property that we refer to as forward super-resolution, then under certain mild regularity conditions, such distribution can be efficiently learned -both in sample and time complexity -by applying SGDA on a GAN objective.foot_1 Moreover, to justify the scope of our theorem, we provide empirical evidence that forward super-resolution holds for practical image distributions, and most of our regularity conditions hold in practice as well. We believe our work extends the scope of traditional distribution learning theory to the regime of learning continuous, complicated real-world distributions such as the distribution of images, which are often generated through some hierarchical generative models. We draw connections between traditional distribution learning techniques such as method of moments to the generator-discriminator framework in GANs, and shed lights on what GANs are doing beyond these techniques.

1.1. FORWARD SUPER-RESOLUTION: A SPECIAL PROPERTY OF IMAGES

Real images can be viewed in multiple resolutions without losing the semantics. In other words, the resolution of an image can be greatly reduced (e.g. by taking the average of nearby pixels), while still keeping the structure of the image. Motivated by this observation, the seminal work of Karras et al. (2018) proposes to train a generator progressively: the lower levels of the generator are trained first to generate the lower-resolution version of images, and then the higher levels are gradually trained to generate higher and higher resolution images. In our work, we formulate this property of images as what we call forward super-resolution: Forward super-resolution property (mathematical statement see Section 2.1): There exists a generator G as an L-hidden-layer neural network with ReLU activation, where each G ℓ represent the hidden neuron values at layer ℓ, and there exists matrices W ℓ such that the distribution of images at resolution level ℓ is given by W ℓ G ℓ and the randomness is taken over the randomness of the input to G (usually standard Gaussian). In plain words, we assume there is an (unknown) neural network G whose hidden layer G ℓ can be used to generate images of resolution level ℓ (larger ℓ means better resolution) via a linear transformation, typically a deconvolution. We illustrate that this assumption holds on practical GAN training in Figure 1 . This assumption is also made in the practical work (Karras et al., 2018) . Moreover, there is a body of works that directly use GANs or deconvolution networks for super-resolution (Bulat & Tzimiropoulos, 2018; Ledig et al., 2017; Lim et al., 2017; Wang et al., 2018; Zhang et al., 2018) .

2. PROBLEM SETUP

Throughout this paper, we use a = poly(b) for a > 0, b > 1 to denote that there are absolute constants we use "w.h.p." to indicate with probability ≥ 1 - C 1 > C 2 > 0 such that b C2 < a < b C1 . For a target learning error ε ∈ [ 1 d ω(1) , 1 poly(d) ], 1 (d/ε) ω(1) . Recall ReLU(z) = max{z, 0}. In this paper, for theoretical purpose we consider a smoothed version ReLU(z) and a leaky version LeakyReLU(z). We give their details in the full version, and they are different from ReLU(z) only by a sufficiently small quantity 1/poly(d/ε).

2.1. THE TARGET DISTRIBUTION: FORWARD SUPER-RESOLUTION STRUCTURE

We consider outputs (think of them as images) {X ⋆ ℓ } ℓ∈[L] , where X ⋆ L is the final output, and X ⋆ ℓ is the "low resolution" version of X ⋆ L , with X ⋆ 1 having the lowest resolution. We think of each ℓ-resolution image X ⋆ ℓ consists of d ℓ patches (for example, an image of size 36 × 36 contains 36 patches of size 6 × 6), where X ⋆ ℓ = (X ⋆ ℓ,j ) j∈[d ℓ ] and each X ⋆ ℓ,j ∈ R d . Typically, such "resolution reduction" from X ⋆ L to X ⋆ ℓ can be given by sub-sampling, average pooling, Laplacian smoothing, etc., but we do not consider any specific form of resolution reduction in this work, as it does not matter for our main result to hold. Formally, we define the forward super-resolution property as follows. We are given samples of the form G ⋆ (z) = (X ⋆ 1 , X ⋆ 2 , • • • , X ⋆ L ) , where each X ⋆ ℓ is generated by an unknown target neural network G ⋆ (z) at layer ℓ, with respect to a standard Gaussian z ∼ N (0, I m0×m0 ). • The basic resolution: for every j ∈ [d 1 ], X ⋆ 1,j = W ⋆ 1,j S ⋆ 1,j ∈ R d for S ⋆ 1,j = S ⋆ 1,j (z) = ReLU(V ⋆ 1,j z -b ⋆ 1,j ) ∈ R m1 ≥0 where V ⋆ 1,j ∈ R m1×m0 , b ⋆ 1,j ∈ R m1 and we assume W ⋆ 1,j ∈ R d×m1 is column orthonormal. • For every ℓ > 1, the image patches at resolution level ℓ are given as: for every j ∈ [d ℓ ], X ⋆ ℓ,j = W ⋆ ℓ,j S ⋆ ℓ,j ∈ R d for S ⋆ ℓ,j = ReLU j ′ ∈P ℓ,j V ⋆ ℓ,j,j ′ S ⋆ ℓ-1,j ′ -b ⋆ ℓ,j ∈ R m ℓ ≥0 where V ⋆ ℓ,j,j ′ ∈ R m ℓ ×m ℓ-1 , b ⋆ ℓ,j ∈ R m ℓ , and we assume W ⋆ ℓ,j ∈ R d×m ℓ is column orthonormal. Here, P ℓ,j ⊆ [d ℓ-1 ] can be any subset of [d ℓ-1 ] to describe the connection graph. Remark. For every layer ℓ, j ∈ [d ℓ ], r ∈ [m ℓ ], one should view of each [S ⋆ ℓ,j ] r as the r-th channel in the j-th patch at layer ℓ. One should think of j ′ ∈P ℓ,j V ⋆ ℓ,j,j ′ S ⋆ ℓ-1,j ′ as the linear "deconvolution" operation over hidden layers. When the network is a deconvolutional network such as in DCGAN (Radford et al., 2015) , we have all W ⋆ ℓ,j = W ⋆ ℓ ; but we do not restrict ourselves to this case. As illustrated in Figure 2 , we should view W ⋆ ℓ,j as a matrix consisting of the "edge-color" features to generate image patches. Crucially, when we get a data sample G ⋆ (z) = (X ⋆ 1 , X ⋆ 2 , • • • , X ⋆ L ), the learning algorithm does not know the underlying z used for this sample. Although our analysis holds in many settings, for simplicity, in this paper we focus on the following parameter regime (for instance, d ℓ can be d ℓ ): To efficient learn a distribution with the "forward super-resolution" structure, we assume that the true distribution in each layer of G ⋆ satisfies the following "sparse coding" structure: Assumption 2.2 (sparse coding structure). For every ℓ ∈ [L], j ∈ [d ℓ ], p ∈ [m ℓ ], there exists some k ℓ ≪ m ℓ with k ℓ ∈ Ω(log m ℓ ), m o(1) ℓ such that -recalling S ⋆ ℓ,j ≥ 0 is a non-negative vector: 3 Pr z∼N (0,I) [S ⋆ ℓ,j ] p > 0 ≤ poly(k ℓ ) m ℓ , E z∼N (0,I) [S ⋆ ℓ,j ] p ≥ 1 poly(k ℓ )m ℓ w.h.p. over z : ∥S ⋆ ℓ,j ∥ ∞ ≤ poly(k ℓ ), ∥S ⋆ ℓ,j ∥ 0 ≤ k ℓ Moreover, we within the same patch, the channels are pair-wise and three-wise "not-too-positively correlated": ∀p, q, r ∈ [m ℓ ], p ̸ = q ̸ = r: Pr z [S ⋆ ℓ,j ] p > 0, [S ⋆ ℓ,j ] q > 0 ≤ ε 1 = poly(k ℓ ) m 2 ℓ , Pr z [S ⋆ ℓ,j ] p > 0, [S ⋆ ℓ,j ] q > 0, [S ⋆ ℓ,j ] r > 0 ≤ ε 2 = 1 m 2.01 ℓ Remark 2. 3. Although we have borrowed the notion of sparse coding, our task is very different from traditional sparse coding. We discuss more in the full version. Sparse coding structure in practice. The sparse coding structure is very natural in practice for generating images (Gu et al., 2015; Zheng et al., 2010) . As illustrated in Figure 2 , typically, after training, the output layer of the generator network W ℓ,j forms edge-color features. It is known that such edge-color features are indeed a (nearly orthogonal) basis for images, under which the coefficients are indeed very sparse. We refer to (Allen-Zhu & Li, 2021) for concrete measurement of the sparsity and orthogonality. The "not-too-positive correlation" property is also very natural: for instance, in an image patch if an edge feature is used, it is less likely that a color feature shall be used (see Figure 2 ). In Figure 3 , we demonstrate that for some learned generator networks, the activations indeed become sparse and "not-too-positively correlated" after training. Crucially, we have only assumed that channels are not-too-positively correlated within a single patch, and channels across different patches (e.g S ⋆ ℓ,1 and S ⋆ ℓ,2 ) can be arbitrarily dependent. This makes sure the global structure of the images can still be quite arbitrary, so Assumption 2.2 can indeed be reasonable. Thus, it can be reasonable to assume that the activations of the target network are also sparse. 4 histogram of Pr[[S ⋆ 2,j ]p > 0] of Pr[[S ⋆ 2,j ]p > 0, [S ⋆ 2,j ]q > 0] histogram of Pr[[S ⋆ 2,j ]p > 0, [S ⋆ 2,j ]q > 0, [S ⋆ 2,j ]r > 0] Missing details. We also make very mild non-degeneracy and anti-concentration assumptions, and give examples for networks satisfying our assumptions. We defer them to the full version.

2.2. LEARNER NETWORK (GENERATOR)

We use a learner network (generator) that has the same structure as the (unknown) target network: • The image of the first resolution is given by: X 1,j = W 1,j S 1,j ∈ R d for S 1,j = LeakyReLU(V 1,j z -b 1,j ) ∈ R m1 for W 1,j ∈ R d×m1 , V 1,j ∈ R m1×m ′ 0 with m ′ 0 ≥ 2d 1 m 1 . • The image of higher resolution is given by: X ℓ,j = W ℓ,j S ℓ,j ∈ R d for S ℓ,j = LeakyReLU j ′ ∈P ℓ,j V ℓ,j,j ′ S ℓ-1,j ′ -b ℓ,j ∈ R m ℓ for W ℓ,j ∈ R d×m ℓ and V ℓ,j ∈ R m ℓ ×m ℓ-1 . One can view S ℓ as the ℓ-th hidden layer. We use G ℓ (z) to denote (X ℓ,j ) j∈[d L ] . We point out both the target and the learner network we study here can be standard deconvolution networks.

2.3. THEOREM STATEMENT

This papers proves that by applying SGDA on a generator-discriminator objective (algorithm to be described in Section 3), we can learn the target distribution using the above generator network. Theorem E.1. For every d > 0, every ε ∈ [ 1 d ω(1) , 1 2 ], letting G(z) = (X 1 (z), . . . , X L (z)) be the generator learned after running Algorithm 6 (which runs in time/sample complexity poly(d/ε)), then w.h.p. there is a column orthonormal matrix U ∈ R m0×m ′ 0 such that Pr z∼N (0,I m ′ 0 ×m ′ 0 ) G ⋆ (Uz) -G(z) 2 ≤ ε ≥ 1 - 1 (d/ε) ω(1) . In particular, this implies the 2-Wasserstein distance W 2 (G(•), G ⋆ (•)) ≤ ε.

3. LEARNING ALGORITHM

In this section, we define the learning algorithm using min-max optimization. We assume one access polynomially many (i.e., poly(d/ε)) i.i.d. samples from the true distribution X ⋆ = (X ⋆ 1 , X ⋆ 2 , • • • , X ⋆ L ) , generated by the (unknown) target network defined in Section 2.1. To begin with, we use a simple SVD warm start to initialize (only) the output layers W ℓ,j of the network. It merely involves a simple estimator of certain truncated covariance of the data. We defer We also point out that if [S ⋆ ℓ,j ]p's are all independent, then Pr [[S ⋆ ℓ,j ]p > 0, [S ⋆ ℓ,j ]q > 0] ≈ 1 m 2 ℓ ≤ ε1 and Pr[[S ⋆ ℓ,j ]p > 0, [S ⋆ ℓ,j ]q > 0, [S ⋆ ℓ,j ]r > 0] ≈ 1 m 3 ℓ ≪ ε2. it to the full paper. Also, we refer stochastic gradient descent ascent SGDA (on the GAN objective) to an algorithm to optimize min x max y f (x, y), where the inner maximization is trained at a faster frequency. We call it Algorithm 4 and include its pseudocode in the full paper. To make the learning process more clear, we break the learning into multiple parts and introduce them separately in this section: • GAN OutputLayer: to learn output matrices {W ℓ,j } per layer. • GAN FirstHidden: to learn hidden matrices {V 1,j } for the first layer. • GAN FowardSuperResolution: to learn higher-level hidden layers {V ℓ,j,j ′ }. We use different discriminators at different parts for our theory analysis, and shall characterize what discriminator does and how the generator can leverage the discriminator to learn the target distribution. We point out, although one can add up and mix those discriminators to make it a single one, how to use a same discriminator across the entire algorithm remains open. At the end of this section, we shall explain how they are combined to give the final training process. Remark 3.1. Although we apply an SVD algorithm to get a warm start on the output matrices W ℓ,j , the majority of the learning of W ℓ,j (e.g., to any small ε = 1 poly(d) error) is still done through gradient descent ascent. We point out that the seminal work on neurally plausible dictionary learning also considers such a warm start (Arora et al., 2015a) .

3.1. LEARN THE OUTPUT LAYER

We first introduce the discriminator for learning the output layer. For each resolution ℓ ∈ [L] and patch j ∈ [d ℓ ], we consider a one-hidden-layer discriminator D (1) ℓ,j (Y ) := r∈[m ℓ ] ReLU ′ ([(W D ℓ,j ) ⊤ Y j ] r -b)⟨Y j , V D ℓ,j,r ⟩ , where the input is either Y = X ⋆ ℓ (from the true distribution) or Y = X ℓ (from the generator). Above, on the discriminator side, we have default parameter W D ℓ,j , b and trainable parameters V D ℓ,j = (V D ℓ,j,r ) r∈[m ℓ ] where each V D ℓ,j,r ∈ R d . On the generator side, we have trainable parameters W ℓ,j (which are used to calculate X ℓ ). (We use superscript D to emphasize W D ℓ,j are the parameters for the discriminator, to distinguish it from W ℓ,j .) In our pseudocode GAN OutputLayer (see Algorithm 1), for fixed W D ℓ,j , b, we perform gradient descent ascent on the GAN objective with discriminator D (1) ℓ,j , to minimize over V D ℓ,j and maximize over W ℓ,j . In our final training process (to be given in full in Algorithm 6), we shall start with some b ≪ 1 and periodically decrease it; and we shall periodically set W D ℓ,j = W ℓ,j to be the same as the generator from a previous check point. • Simply setting W D ℓ,j = W ℓ,j involves no additional learning, as all the learning is still being done using gradient descent ascent. • In practice, the first hidden layer of the discriminator indeed learns the edge-color detectors (see Figure 8 in the full paper), similar to the edge-color features in the output layer of the generator. Thus, setting W D ℓ,j = W ℓ,j is a reasonable approximation. As we pointed out, how to analyze a discriminator that exactly matches practice is an important open theory direction.

INTUITION: WHAT DOES THE DISCRIMINATOR DO?

To further understand the algorithm, we can see that for each V D ℓ,j,r , when its norm is fixed, then the maximizer is obtained at V D ℓ,j,r ∝ E[ReLU ′ ([(W D ℓ,j ) ⊤ X ⋆ ℓ,j ] r -b)X ⋆ ℓ,j ] -E[ReLU ′ ([(W D ℓ,j ) ⊤ X ℓ,j ] r -b)X ℓ,j ] Thus, for the generator to further minimize the objective, the generator will learn to match the moments of the true distribution. In other words, generator wants to ensure E[ReLU ′ ([(W D ℓ,j ) ⊤ X ℓ,j ] r -b)X ℓ,j ] ≈ E[ReLU ′ ([(W D ℓ,j ) ⊤ X ⋆ ℓ,j ] r -b)X ⋆ ℓ,j ] In this paper, we prove that such a truncated moment can be matched efficiently simply by running gradient descent ascent. Moreover, we empirically observe (see Figure 4 ) that GANs can indeed do Algorithm 1 (GAN OutputLayer) method of moments Input: W (0) ℓ,j , b, ℓ, j 1: Set W D ℓ,j ← W (0) ℓ,j ; b ← bm 0.152 ; N ← 1 poly(d/ε) , η ← 1 poly(d/ε) , T ← poly(d/ε) η 2: Set initialization W ℓ,j ← W (0) ℓ,j and V D ℓ,j ← 0. 3: Apply SGDA (Algorithm 4) with N samples, learning rate η for T steps on the following GAN objective (with c being a small constant such as 0.001): min W ℓ,j max V D ℓ,j E[D (1) ℓ,j (X ⋆ ℓ )] -E[D (1) ℓ,j (X ℓ )] -r∈[m ℓ ] ∥V D ℓ,j,r ∥ 1+c 2 ⋄ ∥V D ℓ,j,r ∥ 1+c 2 is an analog of the weight 4: [W ℓ,j ] p ← [W ℓ,j ] p /∥[W ℓ,j ] p ∥ 2 epoch 1 epoch 3 epoch 10 epoch 20 1 st order moment matching 2 nd order moment matching 3 rd order moment matching 4 th order moment matching 5 th order moment matching 6 th order moment matching moment matching within each patch even at the earlier stage of training.

3.2. LEARN THE FIRST HIDDEN LAYER

Due to space limitation we defer the pseudocode and algorithm details of GAN FirstHidden to the full version of this paper. However, we give the high level intuitions below. HIGH-LEVEL INTUITIONS. In the process of learning the lowest-resolution images X ⋆ 1 , one cannot hope for (even approximately) learning the exact matrices V ⋆ 1,j , or the exact function that maps from z → X ⋆ 1 (because z is unknown during the training). Instead, the task is for learning the distribution of X ⋆ 1,j = W ⋆ 1,j ReLU(V ⋆ 1,j z -b ⋆ 1,j ). Suppose for a moment that W ⋆ 1,j are already fully learned; then, it is perhaps not surprising that for the remaining part S ⋆ 1,j = ReLU(V ⋆ 1,j z -b ⋆ 1,j ), if we can somehow 1. learn the marginal distribution of [S ⋆ 1,j ] r for each j, r, and 2. learn the joint distribution of [S ⋆ 1,j ] r , [S ⋆ 1,j ′ ] r ′ for each pair (j, r) ̸ = (j ′ , r ′ ), then, we can recover the joint distribution of {[S ⋆ 1,j ] r } j,r . (As an analogy, for joint Gaussian, it suffices to learn the pair-wise correlation.) To achieve this, we design discriminators D (4) and D (foot_4) . • D (4) discriminates the mismatch from single neurons by ensuring 5 E ReLU [(W D 1,j ) ⊤ X 1,j ] r -b ≈ E ReLU [(W D 1,j ) ⊤ X ⋆ 1,j ] r -b 1x100 4x4x64 𝑧 8x8x64 16x16 x64 32x32 x64 forward super-resolution: a local operation  E ReLU ′ [(W D 1,j ) ⊤ X 1,j ] r -b ≈ E ReLU ′ [(W D 1,j ) ⊤ X ⋆ 1,j ] r -b Furthermore, as long as W D 1,j is moderately learned, the sparse coding structure shall ensure (W D 1,j ) ⊤ X 1,j ≈ S 1,j and (W D 1,j ) ⊤ X ⋆ 1,j ≈ S ⋆ 1,j . For such reason, and using b ≪ 1, applying gradient descent ascent using discriminator D (4) , in fact guarantees E ReLU ([S 1,j ] r ) ≈ E ReLU [S ⋆ 1,j ] r and E ReLU ′ ([S 1,j ] r ) ≈ E ReLU ′ [S ⋆ 1,j ] r Recall [S ⋆ 1,j ] r behaves as ReLU(g) for g ∼ N (-µ, σ 2 ) and has only 2 degrees of freedom; thus, matching moments on ReLU and ReLU ′ can learn the distribution of a single neuron [S ⋆ 1,j ] r . • D (5) discriminates the mismatch from the moments across two neurons, by ensuring E ReLU [(W D 1,j ) ⊤ X 1,j ] r -b ReLU [(W D 1,j ′ ) ⊤ X 1,j ′ ] r ′ -b ≈ E ReLU [(W D 1,j ) ⊤ X ⋆ 1,j ] r -b ReLU [(W D 1,j ′ ) ⊤ X ⋆ 1,j ′ ] r ′ -b For similar reason, gradient descent ascent learns to match moments on the cross terms: E ReLU ([S 1,j ] r ) ReLU ([S 1,j ′ ] r ′ ) ≈ E ReLU [S ⋆ 1,j ] r ReLU [S ⋆ 1,j ′ ] r ′ We show this corresponds to learning ⟨[V ⋆ 1,j ] r , [V ⋆ 1,j ′ ] r ′ ⟩ to a moderate accuracy. In sum, if we apply SGDA on D (4) and D (5) together, we can hope for learning V 1 up to a unitary transformation (see Lemma I.18) . This ensures that we learn the distribution of X ⋆ 1 .

3.3. LEARN HIGHER HIDDEN LAYERS

For resolution ℓ > 1, patch j ∈ [d ℓ ], channel r ∈ [m ℓ ], to learn [V ⋆ ℓ,j ] r , we introduce discriminator D (2) ℓ,j,r (Y 1 , Y 2 ). It takes as input images of two resolutions: one should think of either (Y 1 , Y 2 ) = (X ⋆ ℓ , X ⋆ ℓ-1 ) comes from the true distribution, or (Y 1 , Y 2 ) = (X ℓ , X ℓ-1 ) from the generator. D (2) ℓ,j,r (Y 1 , Y 2 ) := abs (s r -LeakyReLU(s r )) where abs(x) := ReLU(x -b) + ReLU(-x -b) s r := [W D ℓ,j ] ⊤ Y 1,j r s r := j ′ ∈P ℓ,j V D ℓ,j,j ′ LeakyReLU [W D ℓ-1,j ′ ] ⊤ Y 2,j ′ -b D ℓ,j r Above, again W D ℓ,j ,{W D ℓ-1,j ′ } j ′ ∈[d ℓ-1 ] , b are default parameters (changed only periodically). On the discriminator side, {[V D ℓ,j,j ′ ] r } j ′ ∈P ℓ,j , [b D ℓ,j ] r are the actual trainable parameters; on the generator side, {[V ℓ,j,j ′ ] r } j ′ ∈P ℓ,j , [b ℓ,j ] r as the trainable parameters. We note this discriminator D (2) is a three-hidden layer neural network. Yet, we show that such an network (together with the generator) can still be trained efficiently using gradient descent ascent. Algorithm 2 (GAN FowardSuperResolution) using super-resolution to learn higher hidden layers Input: W (0) ℓ , W (0) ℓ-1 , b, ℓ, j 1: Set default parameters W D ℓ,j ← W (0) ℓ,j , W D ℓ-1,j ′ ← W (0) ℓ-1,j ′ ; 2: N ← 1 poly(d/ε) , η ← 1 poly(d/ε) , T ← poly(d/ε) η ; λ G , λ D ← 1 poly(d/ε) 3: Initialize V ℓ,j,j ′ = V D ℓ,j,j ′ = I for one of j ′ ∈ P ℓ,j and setting others as zero. Initialize b ℓ,j = 0. 4: for r ∈ [m ℓ ] do 5: Apply SGDA with N samples, learning rate η for T steps on the following GAN objective min {[V D ℓ,j,j ′ ]r} j ′ ∈P ℓ,j ,[b D ℓ,j ]r; max {[V ℓ,j,j ′ ]r} j ′ ∈P ℓ,j ,[b ℓ,j ]r E[D (2) ℓ,j,r (X ⋆ ℓ , X ⋆ ℓ-1 )] -E[D (2) ℓ,j,r (X ℓ , X ℓ-1 )] -λ G ∥V ℓ ∥ 2 F + λ D ∥V D ℓ ∥ 2 F 6: [b 1,j ] r ← [b 1,j ] r + poly(k 1 )b. INTUITION: WHAT DOES THE DISCRIMINATOR DO? In this case, applying gradient descent ascent on D (2) actually learns how to "super-resolute" the image from resolution level ℓ -1 to level ℓ. In particular, the discriminator wants to find a way where the patches (X ℓ,j , X ℓ-1,j ′ ) differ statistically from the patches (X ⋆ ℓ,j , X ⋆ ℓ-1,j ′ ). For example, it can discriminate when X ⋆ ℓ-1,j ′ = v 1 =⇒ X ⋆ ℓ,j = v 2 , but X ℓ-1,j ′ = v 1 , X ℓ,j ̸ = v 2 . In essence, it is discriminating the way where the generator superresolutes a patch X ⋆ ℓ,j from lower resolution differently from that of the true distribution. As we demonstrate in Figure 5 , such "super-resolution" operation is local, meaning that the learning process can be separated to learning over individual patches. The global structure across different patches of the images are learned in lower resolutions. This makes the learning process much simpler comparing to learning the full image from scratch. 6 We also provide empirical justification of the power of this "forward super-resolution", as in Figure 10 (top) of the full paper: higher layers can indeed learn to super-resolute from the lower resolution images, which makes the learning much easier comparing to learning from scratch.

3.4. FINAL ALGORITHM

We implement our full algorithm in Algorithm 6 (see full paper). It performs layer-wise training. In each outer loop ℓ = 1, 2, . . . , L, it first warm-starts the output layer {W ℓ,j } j∈[d ℓ ] -note those weights are still very inaccurate. 7 Next, for this layer ℓ, Algorithm 6 alternatively: • uses the current output layer W ℓ,j to learn the hidden variables S ℓ,j (or equivalently the weights V ℓ,j , b ℓ,j ) to some accuracy -by applying GAN FirstHidden if ℓ = 1 or GAN FowardSuperResolution if ℓ ≥ 2; and • uses the current hidden variables S ℓ,j to learn the output layer W ℓ,j to an even better accuracy -by applying GAN OutputLayer. This alternating process repeats for T ′ = O(1) stages. Once again, we have broken the learning into multiple parts for analysis purpose, so it becomes clear how the generator can leverage the discriminator at different stages to learn the target distribution. (With more careful choices of learning rates, one can also combine them altogether.) Please note besides a simple SVD warm-start that is called only once per output layer W ℓ,j , all the learning is done using minmax optimization on a generator-discriminator objective. What's in Full Paper. We encourage readers to see our full paper at https://arxiv.org/ abs/2106.02619. In the full version, we includes more related works and missing figures to better support the connection between our theory and practice. We also includes missing details for our technical assumptions from Section 2, and pseudocodes from Section 3. We restate our main theorem and the high level proof plan, and shall also discuss limitations and open directions there.



Full version of this paper can be found on https://arxiv.org/abs/2106.02619. Plus a simple SVD warmup initialization that is easily computable from the covariance of image patches. Here, poly(k ℓ ) can be an arbitrary polynomial such as (k ℓ ) 100 , and our final theorem holds for sufficiently large d because d o(1) > poly(k ℓ ). Within a patch, it is natural that the activations are not-too-positively correlated: for example, once a patch chooses to use a horizontal edge feature, it is less likely that it will pick up another vertical edge feature. Like in the previous subsection, we shall periodically set W D ℓ,j = W ℓ,j to be the same as the generator from a previous check point; and the bias b ≪ 1. At resolution 1 the learning is global; in this case the one-hidden-layer generator can be trained via SGDA to capture the "global structure" of images (see Section 3.2 and Figure1), with the help from properties of Gaussian random variable. Since the hidden variables S ℓ,j at this layer ℓ -which depend on weights {V ℓ,j } j∈[d ℓ ] -are still not learned, at this point, the best one can do is to look at the data covariance and give W ℓ,j a very rough estimate.



Figure 1: Illustration of the forward super-resolution structure. Church images generated by 4-hiddenlayer deconvolution network (DCGAN), trained on LSUN Church data set using multi-scaled gradient (Karnewar & Wang, 2019). The structure of the generator is shown as above, and there is a ReLU activation between each layers. We use simple average pooling to construct low resolution images from the original training images.

examples of patches dominated by edge features ⟸ examples of patches dominated by color features

Figure 2: Visualization of the edge-color features learned in the output layers of G ⋆ . Each W ℓ,j is of dimension m ℓ × d = 64 × 108 = 64 × (6 × 6 × 3). The network is trained as in Figure 1. Note: For a deconvolutional output layer, all W ℓ,j 's are equal for all j ∈ [m ℓ ].

Setting 2.1. L = O(1), each m ℓ = poly(d), each d ℓ = poly(d), and each ∥V ⋆ ℓ,j,j ′ ∥ F ≤ poly(d).

Figure 3: Histograms at random init vs. after training for layer ℓ = 2 of the architecture in Figure 1. Experiments for other layers can be found in Figure 6. It shows the learned network has sparse, not-toopositively correlated hidden activations (we did not regularize sparsity or correlation during training).Thus, it can be reasonable to assume that the activations of the target network are also sparse.

Figure 4: Difference between the moments of a generator's output and the true distribution. The x-axis is the number of epochs and the y-axis quantifies how close the moments are (the smaller the closer).Details are in Figure9. One can see that the moments begin to match after epoch 10.

Figure 5: Forward super-resolution is a local operation; more details in Figure 7.

