SINGLE-LEVEL ADVERSARIAL DATA SYNTHESIS BASED ON NEURAL TANGENT KERNELS Anonymous

Abstract

Generative adversarial networks (GANs) have achieved impressive performance in data synthesis and have driven the development of many applications. However, GANs are known to be hard to train due to their bilevel objective, which leads to the problems of convergence, mode collapse, and gradient vanishing. In this paper, we propose a new generative model called the generative adversarial NTK (GA-NTK) that has a single-level objective. The GA-NTK keeps the spirit of adversarial learning (which helps generate plausible data) while avoiding the training difficulties of GANs. This is done by modeling the discriminator as a Gaussian process with a neural tangent kernel (NTK-GP) whose training dynamics can be completely described by a closed-form formula. We analyze the convergence behavior of GA-NTK trained by gradient descent and give some sufficient conditions for convergence. We also conduct extensive experiments to study the advantages and limitations of GA-NTK and propose some techniques that make GA-NTK more practical. 1

1. INTRODUCTION

Generative adversarial networks (GANs) (Goodfellow et al., 2014; Radford et al., 2016) , a branch of deep generative models based on adversarial learning, have received much attention due to their novel problem formulation and impressive performance in data synthesis. Variants of GANs have also driven recent developments of many applications, such as super-resolution (Ledig et al., 2017) , image inpainting (Xu et al., 2014) , and video generation (Vondrick et al., 2016) . A GANs framework consists of a discriminator network D and a generator network G parametrized by θ D and θ G , respectively. Given a d-dimensional data distribution P data and a c-dimensional noise distribution P noise , the generator G maps a random noise z ∈ R c to a point G(z) ∈ R d in the data space, while the discriminator D takes a point x ∈ R d as the input and tells whether x is real or fake, i.e., D(x ) = 1 if x ∼ P data and D(x ) = 0 if x ∼ P gen , where P gen is the distribution of G(z) and z ∼ P noise . The objective of GANs is typically formulated as a bilevel optimization problem: arg min θ G max θ D E x∼Pdata [log D(x)] + E z∼Pnoise [log(1 -D(G(z)))]. The discriminator D and generator G aim to break each other through the inner max and outer min objectives, respectively. The studies by Goodfellow et al. (2014); Radford et al. (2016) show that this adversarial formulation can lead to a better generator that produces plausible data points/images. However, GANs are known to be hard to train due to the following issues (Goodfellow, 2016) . Failure to converge. In practice, Eq. ( 1) is usually only approximately solved by an alternating first-order method such as the alternating stochastic gradient descent (SGD). The alternating updates for θ D and θ G may cancel each other's progress. During each alternating training step, it is also tricky to balance the number of SGD updates for θ D and that for θ G , as a too small or large number for θ D leads to low-quality gradients for θ G . Mode collapse. The alternating SGD is attracted by stationary points and therefore is not good at distinguishing between a min θ G max θ D problem and a max θ D min θ G problem. When the solution to the latter is returned, the generator tends to always produce the points at modes that best deceive the discriminator, making P gen of low diversity.foot_1 Vanishing gradients. At the beginning of a training process, the finite real and fake training data may not overlap with each other in the data space, and thus the discriminator may be able to perfectly separate the real from fake data. Given the cross-entropy loss (or more generally, any f -divergence measure (Rényi et al., 1961) between P data and P gen ), the value of the discriminator becomes saturated on both sides of the decision boundary, resulting in zero gradients for θ G . In this paper, we argue that the above issues are rooted in the modeling of D. In most existing variants of GANs, the discriminator is a deep neural network with explicit weights θ D . Under gradient descent, the gradients of θ G in Eq. ( 1) cannot be back-propagated through the inner max θ D problem because otherwise it requires the computation of high-order derivatives of θ D . This motivates the use of alternating SGD, which in turn causes the convergence issues and mode collapse. Furthermore, the D is a single network whose particularity may cause a catastrophic effect, such as the vanishing gradients, during training. We instead model the discriminator D as a Gaussian process whose mean and covariance are governed by a kernel function called the neural tangent kernel (NTK-GP) (Jacot et al., 2018; Lee et al., 2019; Chizat et al., 2019) . The D approximates an infinite ensemble of infinitely wide neural networks in a nonparametric manner and has no explicit weights. In particular, its training dynamics can be completely described by a closed-form formula. This allows us to simplify adversarial data synthesis into a single-level optimization problem, which we call the generative adversarial NTK (GA-NTK). Moreover, since D is an infinite ensemble of networks, the particularity of a single element network does not drastically change the training process. This makes GA-NTK less prone to vanishing gradients and stabilizes training even when an f -divergence measure between P data and P gen is used as the loss of D. The following summarizes our contributions: • We propose a single-level optimization method, named GA-NTK, for adversarial data synthesis. It can be solved by ordinary gradient descent, avoiding the difficulties of bi-level optimization in GANs. • We prove the convergence of GA-NTK training under mild conditions. We also show that D being an infinite ensemble of networks can provide smooth gradients for G, which stabilizes GA-NTK training and helps fight vanishing gradients. • We propose some practical techniques to reduce the memory consumption of GA-NTK during training and improve the quality of images synthesized by GA-NTK. • We conduct extensive experiments on real-world datasets to study the advantages and limitations of GA-NTK. In particular, we find that GA-NTK has much lower sample complexity as compared to GANs, and the presence of a generator is not necessary to generate images under the adversarial setting. Note that the goal of this paper is not to replace existing GANs nor advance the state-of-the-art performance, but to show that adversarial data synthesis can be done via a single-level modeling. Our work has implications for future research. In particular, the low sample complexity makes GA-NTK suitable for applications, such as medical imaging, where data are personalized or not easily collectible. In addition, GA-NTK bridges the gap between kernel methods and adversarial data/image synthesis and thus enables future studies on the relationship between kernels and generated data.

2. RELATED WORK

2.1 GANS AND IMPROVEMENTS Goodfellow et al. (2014) proposes GANs and gives a theoretical convergence guarantee in the function space. However, in practice, one can only optimize the generator and discriminator in Eq. (1) in the parameter/weight space. Many techniques have been proposed to make the bilevel optimization easier. Failure to convergence. To solve this problem, studies devise new training algorithms for GANs (Nagarajan & Kolter, 2017; Daskalakis et al., 2018) or more general minimax problems (Thekumparampil et al., 2019; Mokhtari et al., 2020) . But recent works by Mescheder et al. (2018); Farnia & Ozdaglar (2020) show that there may not be a Nash equilibrium solution in GANs. Mode



Our code is available on GitHub at https://github.com/ga-ntk/ga-ntk. Mode collapse can be caused by other reasons, such as the structure of G. This paper only solves the problem due to alternating SGD.

