SINGLE-LEVEL ADVERSARIAL DATA SYNTHESIS BASED ON NEURAL TANGENT KERNELS Anonymous

Abstract

Generative adversarial networks (GANs) have achieved impressive performance in data synthesis and have driven the development of many applications. However, GANs are known to be hard to train due to their bilevel objective, which leads to the problems of convergence, mode collapse, and gradient vanishing. In this paper, we propose a new generative model called the generative adversarial NTK (GA-NTK) that has a single-level objective. The GA-NTK keeps the spirit of adversarial learning (which helps generate plausible data) while avoiding the training difficulties of GANs. This is done by modeling the discriminator as a Gaussian process with a neural tangent kernel (NTK-GP) whose training dynamics can be completely described by a closed-form formula. We analyze the convergence behavior of GA-NTK trained by gradient descent and give some sufficient conditions for convergence. We also conduct extensive experiments to study the advantages and limitations of GA-NTK and propose some techniques that make GA-NTK more practical. 1

1. INTRODUCTION

Generative adversarial networks (GANs) (Goodfellow et al., 2014; Radford et al., 2016) , a branch of deep generative models based on adversarial learning, have received much attention due to their novel problem formulation and impressive performance in data synthesis. Variants of GANs have also driven recent developments of many applications, such as super-resolution (Ledig et al., 2017) , image inpainting (Xu et al., 2014) , and video generation (Vondrick et al., 2016) . A GANs framework consists of a discriminator network D and a generator network G parametrized by θ D and θ G , respectively. Given a d-dimensional data distribution P data and a c-dimensional noise distribution P noise , the generator G maps a random noise z ∈ R c to a point G(z) ∈ R d in the data space, while the discriminator D takes a point x ∈ R d as the input and tells whether x is real or fake, i.e., D(x ) = 1 if x ∼ P data and D(x ) = 0 if x ∼ P gen , where P gen is the distribution of G(z) and z ∼ P noise . The objective of GANs is typically formulated as a bilevel optimization problem: arg min θ G max θ D E x∼Pdata [log D(x)] + E z∼Pnoise [log(1 -D(G(z)))]. The discriminator D and generator G aim to break each other through the inner max and outer min objectives, respectively. The studies by Goodfellow et al. (2014) ; Radford et al. (2016) show that this adversarial formulation can lead to a better generator that produces plausible data points/images. However, GANs are known to be hard to train due to the following issues (Goodfellow, 2016) . Failure to converge. In practice, Eq. ( 1) is usually only approximately solved by an alternating first-order method such as the alternating stochastic gradient descent (SGD). The alternating updates for θ D and θ G may cancel each other's progress. During each alternating training step, it is also tricky to balance the number of SGD updates for θ D and that for θ G , as a too small or large number for θ D leads to low-quality gradients for θ G . Mode collapse. The alternating SGD is attracted by stationary points and therefore is not good at distinguishing between a min θ G max θ D problem and a max θ D min θ G problem. When the solution to the latter is returned, the generator tends to always produce the points at modes that best deceive the discriminator, making P gen of low diversity. 2 Vanishing gradients. At the beginning of a training process, the finite real and fake training data may not overlap with each other in the data space, and thus the discriminator may be able to perfectly separate the real from fake data. Given the cross-entropy loss (or more generally, any f -divergence measure (Rényi et al., 1961) between P data and P gen ), the value of the discriminator becomes saturated on both sides of the decision boundary, resulting in zero gradients for θ G . In this paper, we argue that the above issues are rooted in the modeling of D. In most existing variants of GANs, the discriminator is a deep neural network with explicit weights θ D . Under gradient descent, the gradients of θ G in Eq. ( 1) cannot be back-propagated through the inner max θ D problem because otherwise it requires the computation of high-order derivatives of θ D . This motivates the use of alternating SGD, which in turn causes the convergence issues and mode collapse. Furthermore, the D is a single network whose particularity may cause a catastrophic effect, such as the vanishing gradients, during training. We instead model the discriminator D as a Gaussian process whose mean and covariance are governed by a kernel function called the neural tangent kernel (NTK-GP) (Jacot et al., 2018; Lee et al., 2019; Chizat et al., 2019) . The D approximates an infinite ensemble of infinitely wide neural networks in a nonparametric manner and has no explicit weights. In particular, its training dynamics can be completely described by a closed-form formula. This allows us to simplify adversarial data synthesis into a single-level optimization problem, which we call the generative adversarial NTK (GA-NTK). Moreover, since D is an infinite ensemble of networks, the particularity of a single element network does not drastically change the training process. This makes GA-NTK less prone to vanishing gradients and stabilizes training even when an f -divergence measure between P data and P gen is used as the loss of D. The following summarizes our contributions: • We propose a single-level optimization method, named GA-NTK, for adversarial data synthesis. It can be solved by ordinary gradient descent, avoiding the difficulties of bi-level optimization in GANs. • We prove the convergence of GA-NTK training under mild conditions. We also show that D being an infinite ensemble of networks can provide smooth gradients for G, which stabilizes GA-NTK training and helps fight vanishing gradients. • We propose some practical techniques to reduce the memory consumption of GA-NTK during training and improve the quality of images synthesized by GA-NTK. • We conduct extensive experiments on real-world datasets to study the advantages and limitations of GA-NTK. In particular, we find that GA-NTK has much lower sample complexity as compared to GANs, and the presence of a generator is not necessary to generate images under the adversarial setting. Note that the goal of this paper is not to replace existing GANs nor advance the state-of-the-art performance, but to show that adversarial data synthesis can be done via a single-level modeling. Our work has implications for future research. In particular, the low sample complexity makes GA-NTK suitable for applications, such as medical imaging, where data are personalized or not easily collectible. In addition, GA-NTK bridges the gap between kernel methods and adversarial data/image synthesis and thus enables future studies on the relationship between kernels and generated data.

2. RELATED WORK

2.1 GANS AND IMPROVEMENTS Goodfellow et al. (2014) proposes GANs and gives a theoretical convergence guarantee in the function space. However, in practice, one can only optimize the generator and discriminator in Eq. ( 1) in the parameter/weight space. Many techniques have been proposed to make the bilevel optimization easier. Failure to convergence. To solve this problem, studies devise new training algorithms for GANs (Nagarajan & Kolter, 2017; Daskalakis et al., 2018) or more general minimax problems (Thekumparampil et al., 2019; Mokhtari et al., 2020) . But recent works by Mescheder et al. (2018) ; Farnia & Ozdaglar (2020) show that there may not be a Nash equilibrium solution in GANs. Mode collapse. Metz et al. (2017) alleviates this issue by back-propagating the computation of θ G through the discriminators trained with several steps to strengthen the min θ G max θ D property. Other works mitigate mode collapse by diversifying the modes of D through regularization (Che et al., 2017; Mao et al., 2019) , modeling D as an ensemble of multiple neural networks (Durugkar et al., 2017; Ghosh et al., 2018) , or using additional auxiliary networks (Srivastava et al., 2017; Bang & Shim, 2021; Li et al., 2021) . Vanishing gradients. Mao et al. (2017) tries to solve this problem by using the Pearson χ 2 -divergence between P data and P gen as the loss to penalize data points that are far away from the decision boundary. However, it still suffers from vanishing gradients as any f -divergence measure, including the cross-entropy loss and Pearson χ 2 -divergence, cannot measure the difference between disjoint distributions (Sajjadi et al., 2018) . Later studies replace the loss with either the Wasserstein distance (Arjovsky et al., 2017; Gulrajani et al., 2017) or maximum mean discrepancy (Gretton et al., 2012; Li et al., 2015; 2017) that can measure the divergence of disjoint P data and P gen . In addition, the works by Miyato et al. (2018) ; Qi (2020) aim to constrain the Lipschitz continuity of the discriminator to prevent its value from being saturated. Despite that many efforts have been made to improve the training of GANs, most existing approaches address only one or two issues at a time with different assumptions, and in the meanwhile, they introduce new hyperparameters or side effects. For example, in the Wasserstein GANs (Arjovsky et al., 2017; Gulrajani et al., 2017) mentioned above, efficient computation of Wasserstein distance requires the discriminator to be Lipschitz continuous. However, realizing Lipschitz continuity introduces new hyperparameters and could limit the expressiveness of the discriminator (Anil et al., 2019) . Until now, training GANs is still not an easy task because one has to 1) tune many hyperparameters and 2) strike a balance between the benefits and costs of different training techniques to generate satisfactory data points/images.

2.2. GAUSSIAN PROCESSES AND NEURAL TANGENT KERNELS

Consider an infinite ensemble of infinitely wide networks that use the mean square error (MSE) as the loss and are trained by gradient descent. Recent developments in deep learning theory show that the prediction of the ensemble can be approximated by a special instance of Gaussian process called NTK-GP (Jacot et al., 2018; Lee et al., 2019; Chizat et al., 2019) . The NTK-GP is a Bayesian method, so it outputs a distribution of possible values for an input point. The mean and covariance of the NTK-GP prediction are governed by a kernel function k(•, •) called the neural tangent kernel (NTK). Given two data points x i and x j , the k(x i , x j ) represents the similarity score of the two points in a kernel space, which is fixed once the hyperparameters of the initial weights, activation function, and architecture of the networks in the target ensemble are determined. Here, we focus on the mean prediction of NTK-GP as it is relevant to our study. Consider a supervised learning task given D n = (X n ∈ R n×d , Y n ∈ R n×c ) as the training set, where there are n examples and each example consists of a pair of d-dimensional input and c-dimensional output. Let K n,n ∈ R n×n be the kernel matrix for X n , i.e., K n,n i,j = k(X n i,: , X n j,: ). Then, at time step t during gradient descent, the mean prediction of NTK-GP for X n evolve as (I n -e -ηK n,n t )Y n ∈ R n×c , where I n ∈ R n×n is an identity matrix and η is a sufficiently small learning rate (Jacot et al., 2018; Lee et al., 2019) . The NTK used in Eq. ( 2) can be extended to support different network architectures, including convolutional neural networks (CNNs) (Arora et al., 2019; Novak et al., 2019b) , recurrent neural networks (RNNs) (Alemohammad et al., 2021; Yang, 2019b) , networks with the attention mechanism (Hron et al., 2020) , and other architectures (Yang, 2019b; Arora et al., 2019) . Furthermore, studies (Novak et al., 2019a; Lee et al., 2020; Arora et al., 2020; Geifman et al., 2020) show that NTK-GPs perform similarly to their finite-width counterparts (neural networks) in many situations and sometimes even better on small-data tasks. A recent study by Franceschi et al. (2021) analyzes the behavior of GANs from the NTK perspective by taking into account the alternating optimization. It shows that, in theory, the discriminator can provide a well-defined gradient flow for the generator, which is opposite to previous theoretical interpretations (Arjovsky & Bottou, 2017) . Our work, on the other hand, focuses on adversarial data synthesis without alternating optimization. 3 We make contributions in this direction by (1) formally proving the convergence of the proposed single-level optimization, (2) showing that a generator network is not necessary to generate plausible images (although it might be desirable), and (3) proposing the batch-wise and multi-resolutional extensions that respectively improve the memory efficiency of training and global coherency of generated image patterns.

3. GA-NTK

We present a new adversarial data synthesis method, called the generative adversarial NTK (GA-NTK), based on the NTK theory (Jacot et al., 2018; Lee et al., 2019; Chizat et al., 2019) . For simplicity of presentation, we let G(z) = z ∈ R d and focus on the discriminator for now. We will discuss the case where G(•) is a generator network in Section 3.2. Given an unlabeled, d-dimensional dataset X n ∈ R n×d of n points, we first augment X n to obtain a labeled training set D 2n = (X n ⊕ Z n ∈ R 2n×d , 1 n ⊕ 0 n ∈ R 2n ) , where Z n ∈ R n×d contains n generated points, 1 n ∈ R n and 0 n ∈ R n are label vectors of ones and zeros, respectively, and ⊕ is the vertical stack operator. Then, we model a discriminator trained on D 2n as an NTK-GP. Let K 2n,2n ∈ R 2n×2n be the kernel matrix for X n ⊕Z n , where the value of each element K 2n,2n i,j = k((X n ⊕Z n ) i,: , (X n ⊕Z n ) j,: ) can be computed once we decide the initialization, activation function, and architecture of the element networks in the target infinite ensemble, i.e., the discriminator. By Eq. ( 2) and let λ = η • t, the mean predictions of the discriminator can be written as D(X n , Z n ; k, λ) = (I 2n -e -λK 2n,2n )( 1 n ⊕ 0 n ) ∈ R 2n , where I 2n ∈ R 2n×2n is an identity matrix. We formulate the objective of GA-NTK as follows: arg min Z n L(Z n ), where L(Z n ) = 1 2n -D(X n , Z n ; k, λ) . L(•) is the loss function and 1 2n ∈ R 2n is a vector of ones. Statistically, Eq. ( 4) aims to minimize the Pearson χ 2 -divergence (Jeffreys, 1946) , a case of f -divergence, between P data + P gen and 2P gen , where P gen is the distribution of generated points. Please see Section 6 in Appendix for more details. GA-NTK formulates an adversarial data synthesis task as a single-level optimization problem. On one hand, GA-NTK aims to find points Z n that best deceive the discriminator such that it outputs wrong labels 1 2n for these points. On the other hand, the discriminator is trained on D 2n with the correct labels 1 n ⊕ 0 n and therefore has the opposite goal of distinguishing between the real and generated points. Such an adversarial setting can be made single-level because the training dynamics of the discriminator D by gradient descent can be completely described by a closed-form formula in Eq. ( 3)-any change of Z n causes D to be "retrained" instantly. Therefore, one can easily solve Eq. ( 4) by ordinary SGD. Training. Before running SGD, one needs to tune the hyperparameter λ. We show in the next section that the value of λ should be large enough but finite. Therefore, the complete training process of GA-NTK is to 1) find the minimal λ that allows the discriminator to separate real data from pure noises in an auxiliary task, and 2) solve Z n in Eq. ( 4) by ordinary SGD with the fixed λ. Please see Section 7.3 in Appendix for more details.

3.1. MERITS

As compared to GANs, GA-NTK offers the following advantages: Convergence. The GA-NTK can be trained by ordinary gradient descent. This gives much nicer convergence properties: Theorem 3.1 Let s be the number of the gradient descent iterations solving Eq. ( 4), and let Z n,(s) be the solution at the s-th iteration. Suppose the following values are bounded: (a) X n i,j and Z n,(0) i,j , ∀i, j, (b) t and η, and (c) σ and L. Also, assume that (d) X n contains finite, non-identical, normalized rows. Then, for a sufficiently large t, we have min j≤s ∇ Z n L(Z n,(j) ) 2 ≤ O( 1 s -1 ). We prove the above theorem by showing that, with a large enough λ, ∇ Z n L(Z n,(s) ) is smooth enough to lead to the convergence of gradient descent. For more details, please see Section 6 in Appendix. Diversity. GA-NTK avoids mode collapse due to the confusion between the min-max and max-min problems in alternating SGD. Given different initial values, the generated points in Z n can be very different from each other. No vanishing gradients, no side effects. The hyperparameter λ controls how much D should learn from the true and fake data during each iteration. Figure 5 shows the gradients of D with a finite λ, which do not saturate. This avoids the necessity of using a loss that imposes side effects, such as the Wasserstein distance (Arjovsky et al., 2017; Gulrajani et al., 2017) whose efficient evaluation requires Lipschitz continuity of D.

3.2. GA-NTK IN PRACTICE

Scalability. To generate a large number of points, we can parallelly solve multiple Z n 's in Eq. ( 4) on different machines. On a single machine, the gradients of Z n need to be back-propagated through the computation of K 2021) have been made to reduce the time and space complexity of the evaluation of NTK and its variants, they are still at an early stage of development and the consumed space in practice may still be too large. To alleviate this problem, we propose the batch-wise GA-NTK with the objective arg min Z n E X b/2 ⊂X n ,Z b/2 ⊂Z n 1 b -D(X b/2 , Z b/2 ; k, λ) , that can be solved using mini-batches: during each gradient descent iteration, we 1) randomly sample a batch of b rows in X n ⊕ Z n and their corresponding labels, and 2) update Z n based on K b,b . Although the batch-wise GA-NTK is cosmetically similar to the original GA-NTK, it solves a different problem. In the original GA-NTK, the Z n aims to fool a single discriminator D trained on 2n examples, while in the batch-wise GA-NTK, the Z n 's goal is to deceive many discriminators, each trained on b examples only. Fortunately, Shankar et al. (2020) ; Arora et al. (2020) have shown that NTK-based methods perform well on small datasets. We will conduct experiments to verify this later. Generator Network. So far, we let G(z) = z and show that a generator is not necessary in adversarial data synthesis. 4 Nevertheless, the presence of a generator network may be favorable in some applications to save time and memory at inference time. This can be done by extending the batch-wise GA-NTK as follows: arg min θ G E X b/2 ⊂X n ,Z b/2 ∼N (0,I) 1 b -D(X b/2 , G(Z b/2 ; θ G ); k, λ) , where G(• ; θ G ) is a generator network parametrized by θ G , and Z ∈ R l where l ≤ d. Note that this is still a single-level objective, and θ G can be solved by gradient descent. We denote this variant GA-NTKg. Image Quality. To generate images, one can pair up GA-NTK with a convolutional neural tangent kernel (CNTK) (Arora et al., 2019; Novak et al., 2019b; Garriga-Alonso et al., 2019; Yang, 2019a) that approximates a CNN with infinite channels. This allows the NTK-GP (discriminator) to distinguish between real and fake points based on local patterns in the pixel space. However, the images synthesized by this GA-NTK variant may lack global coherency, just like the images generated by the CNN-based GANs (Radford et al., 2016; Salimans et al., 2016) . Many efforts have been made to improve the image quality of CNN-based GANs, and this paper opens up opportunities for them to be adapted to the kernel regime. In particular, we propose the multi-resolutional GA-CNTK based on the work by Wang et al. (2018) , whose objective is formulated as: arg min Z n m 1 2n -D m (pool m (X n ), pool m (Z n ); k m , λ m ) , where D m is an NTK-GP taking input at a particular pixel resolution and pool m (•) is a downsample operation (average pooling) applied to each row of X n and Z n . The generated points in Z n aim to simultaneously fool multiple NTK-GPs (discriminators), each classifying real and fake images at a distinct pixel resolution. The NTK-GPs working at low and high resolutions encourage global coherency and details, respectively, and together they lead to more plausible points in Z n .

4. EXPERIMENTS

We conduct experiments to study how GA-NTK works in image generation. Datasets. We consider the unsupervised/unconditional image synthesis tasks over real-world datasets, including MNIST (LeCun et al., 2010) , CIFAR-10 (Krizhevsky, 2009) , CelebA (Liu et al., 2015) , CelebA-HQ (Liu et al., 2015) , and ImageNet (Deng et al., 2009) . To improve training efficiency, we resize CelebA images to 64×64 and ImageNet images to 128×128 pixels, respectively. We also create a 2D toy dataset consisting of 25-modal Gaussian mixtures of points to visualize the behavior of different image synthesis methods. GA-NTK implementations. GA-NTK works with different NTK-GPs. For the image synthesis tasks, we consider the NTK-GPs that model the ensembles of fully-connected networks (Jacot et al., 2018; Lee et al., 2019; Chizat et al., 2019) and convolutional networks (Arora et al., 2019; Novak et al., 2019b; Garriga-Alonso et al., 2019; Yang, 2019a) , respectively. We implement GA-NTK using the Neural Tangents library (Novak et al., 2019a) and call the variants based on the former and latter NTK-GPs the GA-FNTK and GA-CNTK, respectively. In GA-FNTK, an element network of the discriminator has 3 infinitely wide, fully-connected layers with ReLU non-linearity, while in GA-CNTK, an element network follows the architecture of InfoGAN (Chen et al., 2016) except for having infinite filters at each layer. We tune the hyperparameters of GA-FNTK and GA-CNTK following the method proposed in Poole et al. (2016) ; Schoenholz et al. ( 2017); Raghu et al. (2017) . We also implement their batchwise, generator, and multi-resolutional variants described in Section 3.2. See Section 7 in Appendix for more details. Baselines. We compare GA-NTK with some popular variants of GANs, including vanilla GANs (Goodfellow et al., 2014) , DCGAN (Radford et al., 2016) , LSGAN (Mao et al., 2017) , WGAN (Arjovsky et al., 2017) , WGAN-GP (Gulrajani et al., 2017) , SNGAN (Miyato et al., 2018) and StyleGAN2 (Karras et al., 2020) . To give a fair comparison, we let the discriminator of each baseline follow the architecture of InfoGAN (Chen et al., 2016) and tune the hyperparameters using grid search. Metrics. We evaluate the quality of a set of generated images using the Fréchet Inception Distance (FID) (Heusel et al., 2017) . The lower the FID score the better. We find that an image synthesis method may produce downgrade images that look almost identical to some images in the training set. Therefore, we also use a metric called the average max-SSIM (AM-SSIM) that calculates the average of the maximum SSIM score (Wang et al., 2004) between P gen and P data : AM-SSIM(P gen , P data ) = E x ∼Pgen [ max x∼Pdata SSIM(x , x)]. A generated image set will have a higher AM-SSIM score if it contains downgrade images. Environment and limitations. We conduct all experiments on a cluster of machines having 80 NVIDIA Tesla V100 GPUs. As discussed in 3.2, GA-NTK consumes a significant amount of memory on each machine due to the computations involved in the kernel matrix K 2n,2n . With the current version of Neural Tangents library (Novak et al., 2019a) and a V100 GPU of 32GB RAM, the maximum sizes of the training set from MNIST, CIFAR-10, CelebA, and ImageNet are 1024, 512, 256, and 128, respectively (where the computation graph and backprop operations of K 2n,2n consume about 27.5 GB RAM excluding other necessary operations). Since our goal is not to achieve state-of-the-art performance but to compare different image synthesis methods, we train all the methods using up to 256 images randomly sampled from all classes of MNIST, the "horse" class CIFAR-10, the "male with straight hair" class of CelebA, and the "daisy" class of ImageNet, respectively. We will conduct larger-scale experiments in Section 9.2. For more details about our experiment settings, please see Section 7 in Appendix.

4.1. IMAGE QUALITY

We first study the quality of the images synthesized by different methods. 1 . In particular, WGAN-GP performs the best among the GAN variants. However, WGAN-GP limits the Lipschitz continuity of the discriminator and gives higher FID scores than GA-CNTK. Also, it gives higher AM-SSIM values as the size of the training set decreases, implying there are many downgrade images that look identical to some training images. This is because the Wasserstein distance, which is also called the earth mover's distance, allows fewer ways of moving when there are less available "earth" (i.e., the density values of P data and P gen ) due to a small n, and thus the P gen needs to be exactly the same as P data to minimize the distance. 5 The GA-NTK variants, including GA-CNTK and GA-CNTKg ("g" means "with generator"), perform relatively well due to their lower sample complexity, which aligns with the previous observations (Shankar et al., 2020; Arora et al., 2020) in different context. Next, we compare the images generated by the multi-resolutional GA-CNTK and GA-CNTKg (see Section 3.2) on the CelebA-HQ dataset. The multi-resolutional GA-CNTK employs 3 discriminators working at 256×256, 64×64, and 16×16 pixel resolutions, respectively. Figure 2 shows the results. We can see that the multi-resolutional GA-CNTK (without a generator) gives better-looking images than GA-CNTKg (with a generator) because learning a generator, which maps two spaces, is essentially a harder problem than finding a set of plausible z's. Although synthesizing data faster at inference time, a generator may not be necessary to generate high-quality images under the adversarial setting.

4.2. TRAINING STABILITY

Convergence. Figure 3 shows the learning curve and the relationship between the image quality and the number of gradient descent iterations during a training process of GA-CNTK. We find that 5 The problem of Lipschitz continuity may be alleviated when n becomes larger. GA-CNTK easily converges under various conditions, which is supported by Theorem 3.1. Furthermore, we can see a correlation between the image quality and the loss value-as the loss becomes smaller, the quality of the synthesized images improves. This correlation can save human labor from monitoring the training processes, which is common when training GANs. Note that the images generated in the latter stage of training contain recognizable patterns that change over training time. This is a major source of GA-CNTK creativity. Please see Section 9.4 for more discussions. in Eq. ( 4) obtained from different types of D. Mode collapse. To study how different methods align P gen with P data , we train them using a 2D toy training set where P data is a 25modal Gaussian mixture. We use two 3-layer fully-connected neural networks as the generator and discriminator for each baseline and an ensemble of 3-layer, infinitely wide counterpart as the discriminator in GA-FNTK. For GANs, we stop the alternating SGD training when the generator receives 1000 updates, and for GA-FNTK, we terminate the GD training after 1000 iterations. Figure 4 shows the resultant P gen of different methods. GA-FNTK avoids mode collapse due to the use of alternating SGD. Gradient vanishing. To verify that GA-NTK gives no vanishing gradients with a finite λ, we conduct an experiment using another toy dataset consisting of 256 MNIST images and 256 random noises. We replace the discriminator of GA-CNTK with a single parametric network of the same architecture but finite width. We train the finite-width network on the toy dataset by minimizing the MSE loss using gradient descent. We set the training iteration to a large value (65536) to simulate the situation where the network value becomes saturated on both sides of the decision boundary. Figure 5 compares the gradients of a generated image Z n i,: in Eq. ( 4) obtained from 1) the finite-width network and 2) the corresponding GA-CNTK with a large t. As Z n i,: evolves through gradient descent iterations, the norm of its gradients obtained from the finite-width discriminator quickly shrinks to zero. On the other hand, the gradient norm obtained from the discriminator of GA-CNTK is always positive thanks to the infinite ensembling. 2 summarizes the FID and AM-SSIM scores of the generated images. On MNIST, WGAN-GP slightly outperforms GA-CNTKg. The training of WGAN-GP on MNIST is easy, so GA-CNTKg does not offer much advantage. However, in a more complex task like CIFAR-10 or CelebA, GA-CNTKg outperforms WGAN-GP, suggesting that our single-level modeling is indeed beneficial.

4.3. SCALABILITY

We have conducted more experiments. Please see Appendix for their results.

5. CONCLUSION

We proposed GA-NTK and showed that adversarial data synthesis can be done via single-level modeling. It can be solved by ordinary gradient descent, avoiding the difficulties of bi-level training of GANs. We analyzed the convergence behavior of GA-NTK and gave sufficient conditions for convergence. Extensive experiments were conducted to study the advantages and limitations of GA-NTK. We proposed the batch-wise and multi-resolutional variants to improve memory efficiency and image quality, and showed that GA-NTK works either with or without a generator network. GA-NTK works well with small data, making it suitable for applications where data are hard to collect. GA-NTK also opens up opportunities for one to adapt various GAN enhancements into the kernel regime. These are matters of our future inquiry.

6. STATISTICAL INTERPRETATION OF GA-NTK

Statistically, minimizing Eq. ( 4) or (6) amounts to minimizing the Pearson χ 2 -divergence (Jeffreys, 1946) , a case of f -divergence (Rényi et al., 1961) , between P data + P gen and 2P gen , where P data is the distribution of real data and P gen is the distribution of generated points. To see this, we first rewrite the loss of our discrimonator D, denoted by L(D), in expectation: arg min D L(D) = arg min D E x∼Pdata (D(x) -1) 2 + E x∼Pgen (D(x) -0) 2 . ( ) Here, P gen can represent either Z in Eq. ( 4) or the output of the generator G in Eq. ( 6). Similarly, the loss function for our P gen , denoted by L(P gen ; D), can be written as follows: arg min Pgen L(P gen ; D) = arg min Pgen E x∼Pdata (D(x) -1) 2 + E x∼Pgen (D(x) -1) 2 . ( ) GA-NTK, in the form of Eqs. ( 8) and ( 9), is a special case of LSGAN (Mao et al., 2017) . Let D * be the minimizer of Eq. ( 8). We can see that Eqs. ( 4) and ( 6) effectively solve the problem: The loss becomes zero when P data (x) = P gen (x) for all x. Therefore, minimizing Eq. ( 4) or ( 6) brings P gen closer to P data . 7 PROOF OF THEOREM 3.1 In this section, we prove the convergence of a GA-NTK whose discriminator D approximates an infinite ensemble of infinitely-wide, fully-connected, feedforward neural networks. The proof can be easily extended to other network architectures such as convolutional neural networks.

7.1. BACKGROUND AND NOTATION

Consider a fully-connected, feedforward neural network f : R d → R, f (x; θ) = σ w √ d L-1 w L φ σ w √ d L-2 W L-1 φ • • • φ σ w √ d W 1 x + σ b b 1 • • • + σ b b L-1 + σ b b L , (11) where φ(•) is the activation function (applied element-wisely), L is the number of hidden layers, {d 1 , • • • , d L-1 } are the dimensions (widths) of hidden layers, θ = ∪ L l=1 θ l = ∪ L l=1 (W l ∈ R d l ×d l-1 , b l ∈ R d l ) are trainable weights and biases whose initial values are i.i.d. Gaussian random variables N (0, 1), and σ 2 w and σ 2 b are scaling factors that control the variances of weights and biases, respectively. Suppose f is trained on a labeled dataset D 2n = (X n ⊕ Z n ∈ R 2n×d , 1 n ⊕ 0 n ∈ R 2n ) by minimizing the MSE loss using t gradient-descent iterations with the learning rate η. Let θ (0) and θ (t) be the initial and trained parameters, respectively. As d 1 , • • • , d L → ∞, we can approximate the distribution of f (x; θ (t) ) as a Gaussian process (NTK-GP) (Jacot et al., 2018; Lee et al., 2019; Chizat et al., 2019) whose behavior is controlled by a kernel matrix K 2n,2n = ∇ θ f (X n ⊕ Z n ; θ (0) ) ∇ θ f (X n ⊕ Z n ; θ (0) ) ∈ R 2n×2n , where f (X n ⊕ Z n ; θ (0) ) ∈ R 2n is the vector of in-sample predictions made by the initial f . The value of each element K 2n,2n i,j = k L ((X n ⊕ Z n ) i,: , (X n ⊕ Z n ) j,: ) presents the similarity score of two rows (points) of X n ⊕ Z n in a kernel space, and it can be expressed by a kernel function k L : R d × R d → R, called the neural tangent kernel (NTK). The NTK is deterministic as it depends only on φ(•), σ w , σ b , and L rather than the specific values in θ (0) . Furthermore, it can be evaluated layer-wisely. Let h l j (x) ∈ R d l be the pre-activation of the j-th neuron at the l-th layer of f (x; θ (t) ). The distribution of h l j (x) is still an NTK-GP, and its associated NTK is defined as k l : R d ×R d → R, k l (x, x ) = ∇ θ ≤l h l j (x) ∇ θ ≤l h l j (x ), where θ ≤l = ∪ l i=1 θ i . Note that all h l j (x)'s, ∀j, are i.i.d. and thus share the same kernel. It can be shown that k l (x, x ) = ∇ θ l h l j (x) ∇ θ l h l j (x ) + ∇ θ ≤l-1 h l j (x) ∇ θ ≤l-1 h l j (x ) = kl (x, x ) + σ 2 w k l-1 (x, x )E (h (l-1) j (x), h (l-1) j (x ))∼N (0 2 , Kl-1 ) φ (h (l-1) j (x))φ (h (l-1) j (x )) (13) and k 1 (x, x ) = σ 2 w d x x + σ 2 b ( ) where kl : R d × R d → R is the NNGP kernel (Lee et al., 2018; Matthews et al., 2018) that controls the behavior of another Gaussian process, called NNGP, approximating the distribution of f (x; θ (0) ), and Kl-1 = kl-1 (x, x) kl-1 (x, x ) kl-1 (x, x ) kl-1 (x , x ) ∈ R 2×2 .

7.2. CONVERGENCE

The GA-NTK employs the above NTK-GP as the discriminator D. So, the in-sample mean predictions of D can be written as a closed-form formula: D(X n , Z n ) = (I 2n -e -ηtK 2n,2n )y 2n ∈ R 2n , ( ) where I 2n is an identity matrix and y 2n = 1 n ⊕ 0 n ∈ R 2n is the "correct" label vector for training D. We formulate the objective of GA-NTK as: arg min Z n L(Z n ) = arg min Z n 1 2 1 2n -D(X n , Z n ) 2 , where 1 2n ∈ R 2n in the loss L(•) is the "wrong" label vector that guides us to find the points (Z n ) that best deceive the discriminator. We show that Theorem 7.1 Let s be the number of the gradient descent iterations solving Eq. ( 16), and let Z n,(s) be the solution at the s-th iteration. Suppose the following values are bounded: (a) X n i,j and Z n,(0) i,j , ∀i, j, (b) t and η, and (c) σ and L. Also, assume that (d) X n contains finite, non-identical, normalized rows. Then, for a sufficiently large t, we have min j≤s ∇ Z n L(Z n,(j) ) 2 ≤ O( 1 s -1 ).

7.3. PROOF

To prove Theorem 7.1, we first introduce the notion of β smoothness: Definition 7.1 A continuously differentiable function g : R d → R is β-smooth if there exits β ∈ R such that ∇ a g(a) -∇ b g(b) ≤ β a -b for any a, b ∈ R d . It can be shown that gradient descent finds a stationary point of a β-smooth function efficiently (Gower, 2022). Lemma 7.1 Let a (s) be the input of a function g : R d → R after applying s gradient descent iterations to an initial input a (0) . If g is β-smooth, then g(a (s) ) converges to a stationary point at rate min j≤s ∇ a g(a (j) ) 2 ≤ O( 1 s -1 ). So, our goal is to show that the loss L(Z n ) in Eq. ( 16) is β-smooth w.r.t. any generated point z ∈ R d . Corollary 7.1 If all the conditions (a)-(d) in Theorem 7.1 hold, there exits a constant c 1 ∈ R + such that ∇ z L(Z n ) ≤ c 1 for each row z ∈ R d of Z n . This makes L(Z n ) β-smooth. To prove Corollary 7.1, consider D i (X n , Z n ) and ∇ zj L(Z n ), the i-th and j-th elements of D(X n , Z n ) ∈ R 2n and ∇ z L(Z n ) ∈ R d , respectively. We have ∇ zj L(Z n ) = ∇ zj 1 2 1 2n -D(X n , Z n ) 2 = 2n i=1 (D i (X n , Z n ) -1) • ∇ zj D i (X n , Z n ) (17) Given a sufficiently large t, the D i (X n , Z n ) can be arbitrarily close to y i ∈ {0, 1} because K 2n,2n is positive definite (Jacot et al., 2018) and therefore (I 2n -e -ηtK 2n,2n ) → I 2n as t → ∞ in Eq. ( 15). There exists ∈ R + such that |∇ zj L(Z n )| ≤ n i=1 |∇ zj D i (X n , Z n ) + (1 + ) 2n i=n+1 |∇ zj D i (X n , Z n )| ≤ (1 + ) 2n i=1 |∇ zj D i (X n , Z n )| = (1 + ) 2n i=1 |∇ zj 2n p=1 (I 2n i,p -e -ηtK 2n,2n i,p )y 2n p | = (1 + )ηt 2n i,p,q=1 e -ηtK 2n,2n i,q |∇ zj k L ((X n ⊕ Z n ) q,: , (X n ⊕ Z n ) p,: )y 2n p |. Note that e -ηtK 2n,2n i,q ∈ R + can be arbitrarily close to 0 with a sufficiently large t. Hence, Corollary 7.1 holds as long as ∇ zj k L ((X n ⊕ Z n ) q,: , (X n ⊕ Z n ) p,: ) is bounded. Corollary 7.2 If the conditions (a)-(d) in Theorem 7.1 hold, there exits a constant c 2 ∈ R + such that ∇ zj k L (a, b) ≤ c 2 for any two rows a and b of X n ⊕ Z n . It is clear that ∇ zj k L (a, b) = 0 if a, b = z. So, without loss of generality, we consider ∇ zj k L (a, z) only. From Eq. ( 13), we have ∂k L (a, z) ∂z j = ∂k L (a, z) ∂k L-1 (a, z) ∂k L-1 (a, z) ∂k L-2 (a, z) • • • ∂k 1 (a, z) ∂z j . For each l = 2, • • • , L, we can bound ∂k l (a, z)/∂k l-1 (a, z) by ∂k l (a,z) ∂k l-1 (a,z) = σ 2 w E (h (l-1) j (x), h (l-1) j (x ))∼N (0 2 , Kl-1 ) φ (h (l-1) j (x))φ (h (l-1) j (x )) ≤ (σ w max h φ (h)) 2 provided that the maximum slope of φ is limited, which is true for many popular activation functions including ReLU and erf. Also, by Eq. ( 14), the value ∂k 1 (a, z) ∂z j = σ 2 w d a j is bounded. Therefore, Corollary 7.2 holds, which in turn makes L(Z n ) β-smooth via Corollary 7.1. By Lemma 7.1, we obtain the proof of Theorem 7.1.

8. EXPERIMENT SETTINGS

This section provides more details about the settings of our experiments.

8.1. MODEL SETTINGS

The network architectures of the baseline GANs used in our experiments are based on InfoGAN (Chen et al., 2016) . We set the latent dimensions, training iterations, and batch size according to the study (Lucic et al., 2018) . The latent dimensions for the generator are all 64. The batch size for all baselines is set to 64. The training iterations are 80K, 100K, and 400K for MNIST, CelebA, and CIFAR-10 datasets, respectively. For the optimizers, we follow the setting from the respective original papers. Below we list the network architecture of the baselines for each dataset as well as the optimizer settings. Note that we remove all the batchnorm layers for the discriminators in WGAN-GP. We architect the element network of the discriminator in our GA-NTK following InfoGAN (Chen et al., 2016) , except that the width (or the number of filters) of the network is infinite at each layer and has no batchnorm layers. The generator of GA-NCTKg consumes memory. To reduce memory consumption, we let D discriminates true and fake images in the code space of a pre-trained autoencoder A (Bergmann et al., 2019) . After training, a code output by G is fed into the decoder of A to obtain an image. The architectures of the pre-trained A for different datasets are summarized as follows: Algorithm 1 Unidirectional search for the hyperparameter λ of GA-NTK. Input: Data X n , kernel k, and separation tolerance Output: λ for GA-NTK Randomly initiate Z n ∈ R n×d λ ← 1 while 1 2n D(X n , Z n ; k, λ) -(1 n ⊕ 0 n ) 2 ≤ do λ ← λ • 2 end return λ

8.3. HYPERPARAMETER TUNING

For each data synthesis method, we tune its hyperparameter using grid search. GA-NTK. The computation of K 2n,2n requires one to determine the initialization and architecture of the element networks in the ensemble discriminator. Poole et al. (2016); Schoenholz et al. (2017) ; Raghu et al. (2017) have proposed a principled method to tune the hyperparameters for the initialization. From our empirical results, we also find that the quality of the images generated by GA-NTK is not significantly impacted by the choice of the architecture-a fully connected network with rectified linear unit (ReLU) activation suffices to generate recognizable image patterns. Once K 2n,2n is decided, there is only one hyperparameter λ = ηt to tune in Eq. ( 16). The λ controls how well the discriminator is trained on D, so either a too small or large value can lead to poor gradients for Z n and final generated points. But since there is no alternating updates as in GANs, we can decide an appropriate value of λ without worrying about canceling the learning progress of Z n . We propose a simple, unidirectional search algorithm for tuning λ, as shown in Algorithm 1. Basically, we search, from small to large, for a value that makes the discriminator nearly separate the real data from pure noises in an auxiliary learning task, and then use this value to solve Eq. ( 16). In practice, a small positive ranging from 10 -3 to 10 -2 suffices to give an appropriate λ. Multi-resolutional GA-NTK. We use 3 NTK-GP's as the discriminators, whose architectures are listed in Table 8 . Next, we compare the images generated by GA-FNTK, GA-CNTK, and the multi-resolutional GA-CNTK described in Section 3.2 on the CelebA and CelebA-HQ datasets. The multi-resolutional GA-CNTK employs 3 discriminators working at 256 × 256, 64×64, and 16×16 pixel resolutions, respectively. Figure 6 shows the results. To our surprise, GA-NTK (which models the discriminator as an ensemble of fully connected networks) suffices to generate recognizable faces. The images synthesized by GA-FNTK and GA-CNTK lack details and global coherence, respectively, due to The results also demonstrate the potential of GA-NTK variants to generate high-quality data as there are many other techniques for GANs that could be adapted into GA-NTK.

9.2. BATCH-WISE GA-NTK

To work with a larger training set, we modify GA-CNTK by following the instructions in Section 3.2 to obtain the batch-wise GA-CNTK, which computes the gradients of Z n in Eq. ( 4) from 256 randomly sampled training images during each gradient descent iteration. We train the batch-wise GA-CNTK on two larger datasets consisting of 2048 images from CelebA and 1300 images from ImageNet, respectively. Figure 8 shows the results, and the batch-wise GA-CNTK can successfully generate the "daisy" images on ImageNet. Note that the batch-wise GA-CNTK solves a different problem than the original GA-CNTK-the former finds Z n that deceives multiple discriminators, each trained on 256 examples, while the latter searches for Z n that fools a single discriminator trained on 256 examples. We found that, when the batch size is small (b = 1), GA-NTK tends to generate a blurry mean image regardless of model architectures and initializations of model weights and Z n , as shown in Figure 7 . This is because the mean image is the best for simultaneously fooling many NTK discriminators, each trained on a single example. However, in practice this setting is less common as one usually aims to use the largest b possible (Brock et al., 2018) . Figure 8 shows that a batch size of 256 suffices to give plausible results on the CelebA and ImageNet datasets. Comparing the images in Figure 1 (f) with those in Figure 8 (a), we can see that the batch-wise GA-CNTK gives a little more blurry images but the patterns in each synthesized image are more globally coherent, both due to the effect of multiple discriminators.

9.3. SENSITIVITY TO HYPERPARAMETERS

Here, we study how sensitive is the performance of WGAN, WGAN-GP, and GA-FNTK to their hyperparameters. We adjust the hyperparameters of different approaches using the grid search under a time budget of 3 hours, and then evaluate the quality of 2048 generated data points by the Wasser- stein distance between P gen and P data . We train different methods on two toy datasets consisting of 8-and 25-modal Gaussian mixtures following the settings described in Section 4.2. Figure 9 shows the results, and we can see that GA-FNTK achieves the lowest average Wasserstein distance in both cases. Moreover, its variances are smaller than the two other baselines, too. This shows that the performance of GA-FNTK is less sensitive to the hyperparameters and could be easier to tune in practice. Note that, with 3-hour time budget, the hyperparameters we obtained through the grid search are good enough for reproducing the experiments conducted by Mao et al. (2017) on mode collapse. In the experiments, the P gen of different methods aim to align a 2D 8-modal Gaussian mixtures in the ground truth. Our results are shown in Figure 10 . 

9.4. EVOLUTION OF IMAGES DURING TRAINING

Figure 11 shows the learning curve of the generator in G in GA-CNTKg and the relationship between the quality of images output by G and the number of gradient descent iterations. The results show that the loss can be minimized even if it is an f -divergence, and a lower loss score implies higher image quality. This is consistent with the results of GA-CNTK (without a generator) shown in Figure 3 . Source of creativity. The diversity of our generated data not only comes from the randomness of an optimization algorithm (e.g., initialization of Z or splitting of X into batches, as discussed in Section 3.2) but also from the objective in Eq. ( 4) itself. To see this, observe in Figure 3 that the images generated at the later stage of training contain recognizable patterns that change constantly over training time, despite little change in the loss score. The reason is that, in Eq. ( 4), the Z n is optimized for a moving target-any change of Z n causes D to be "retrained" instantly. The training of the generator G in Eq. ( 6) also shares this nice property. In Figure 11 , the patterns of a generated image G(z) change over training time even when the input z is fixed. However, getting diverse artificial data through this property requires prolonged training time. In practice, we can simply initialize Z differently to achieve diversity faster. 10 MORE IMAGES GENERATED BY GA-CNTK AND GA-CNTKG Figures 12-16 show more sample images synthesized by GA-CNTK and GA-CNTKg. All these images are obtained using the settings described in the main paper and the above. We can see that the quality of the images synthesized by GA-CNTKg is worse than that of the images synthesized by GA-CNTK, as discussed in Section 4.1. Furthermore, recall from Table 1 that, without a generator network, the GA-NTK performs better when the date size increases. However, this is not the case for GA-NTKg having a generator network. We have resampled training data and rerun the experiments 5 times with different initial values of Z n but obtained similar results. Therefore, we believe the instability is due to the sample complexity of the generator network-256 examples or less are insufficient to train a stable, high-quality generator. This is evident in As discussed in the main paper, we find that, when the size of training set is small, an image synthesis method may produce downgrade images that look almost identical to some images in the training set. This problem is less studied in the literature but important to applications with limited training data. We investigate this problem by showing the images from the training set that are the nearest to a generated image. We use the SSIM (Wang et al., 2004) as the distance measure. Figures 17, 18, and 19 show the results for some randomly sampled synthesized images. As compared to GANs, both GA-CNTK and batch-wise GA-CNTK can generate images that look less similar to the ground-truth images. 



Our code is available on GitHub at https://github.com/ga-ntk/ga-ntk. Mode collapse can be caused by other reasons, such as the structure of G. This paper only solves the problem due to alternating SGD. From GAN perspective, our work can be regarded as a special case of the framework proposed byFranceschi et al. (2021), where the discriminator neglects the effect of historical generator updates and only distinguish between the true and currently generated data at each alternating step. In GANs, solving Z directly against a finite-width discriminator is infeasible because it amounts to finding adversarial examples(Goodfellow et al., 2015) whose gradients are known to be very noisy(Ilyas et al., 2019).



Figure 1: The images generated by different methods on MNIST, CIFAR-10, and CelebA datasets given only 256 training images.

Figure 2: The images generated by GA-CNTK (a) without and (b) with a generator given 256 CelebA-HQ training images.

Figure 3: The learning curve and image quality at different stages of a training process.

Figure 4: Visualization of distribution alignment and mode collapse on a 2D toy dataset.

Figure5: Comparison between the gradients of a Z i,: in Eq. (4) obtained from different types of D.

gen ; D * ) = arg min Pgen E x∼Pdata (D * (x) -1) 2 + E x∼Pgen (D * (x) -1) 2 .(10)Mao et al. (2017) show that, under mild relaxation, minimizing Eq. (10) yields minimizing the Pearson χ 2 -divergence between P data + P gen and 2P gen : arg min Pgen L(P gen ; D * ) = arg min Pgen χ 2 Pearson (P data + P gen 2P gen ) = arg min Pgen (P data (x) + P gen (x)) 2P gen (x) P data (x) + P gen (x) -1 2 dx.

Figure 6: The images generated by (a) GA-FNTK and (b) GA-CNTK given 256 CelebA training images.

Figure 7: When b = 1, GA-NTK tends to generate a blurry mean image.

Figure 8: The images generated by batch-wise GA-CNTK on (a) CelebA dataset of 2048 randomly sampled images and (b) ImageNet dataset of 1300 randomly sampled images.

Figure 11: The learning curve of G in GA-CNTKg and the generated images G(z) at different stages of training given the same input z.

Figures 12(b)-15(b) where the generator outputs unrecognizable images more often.

Figure 12: Sample images generated by GA-CNTK (a) without and (b) with generator on the MNIST dataset of 256 randomly sampled images.

Figure 15: Sample images generated by multi-resolutional GA-CNTK on the CelebA-HQ dataset of 256 randomly sampled images.

Figure 16: Sample images generated by GA-CNTKg on the CelebA-HQ dataset of 256 randomly sampled images.

Figure 17: Comparison between the images generated by WGAN-GP trained on 256 images and the nearest neighbors (measured by SSIM) from the training set. Images with red bounding boxes are generated images.

Figure 18: Comparison between the images generated by GA-CNTK trained on 256 images and the nearest neighbors (measured by SSIM) from the training set. Images with red bounding boxes are generated images.

Figure 19: Comparison between the images generated by GA-CNTKg trained on 256 images and the nearest neighbors (measured by SSIM) from the training set. Images with red bounding boxes are generated images.

2n,2n , which has O(n 2 ) space complexity. This may incur scalability issues for large datasets. Although recent efforts byArora et al. (2019);Bietti & Mairal (2019);Han et al. (2021);Zandieh et al. (

Table 1 summarizes the FID and AM-SSIM scores of the generated images. LSGAN and DCGAN using f -divergence as the loss function give high FID and fail to generate recognizable images on CIFAR-10 and CelebA datasets due to the various training issues mentioned previously. StyleGAN, although being able

The FID and AM-SSIM scores of the images generated by different methods.

The FID and AM-SSIM scores of the images output by WGAN-GP and GA-CNTKg trained on 2048 CelebA images with batch size 256.

The architectures of the discriminator and generator in the baseline GANs for the MNIST dataset.

The architectures of the discriminator and generator in the baseline GANs for the CIFAR-10 dataset.

The architectures of the discriminator and generator in the baseline GANs for the CelebA dataset.

The optimizer settings for each GAN baseline. n dis denotes the training steps for discriminators in the alternative training process.

The architectures of the discriminators for multi-resolution GA-NTK.

annex

12 SEMANTICS LEARNED BY GA-CNTKGHere, we investigate whether the features learned by GA-NTK can encode high-level semantics.We plot "interpolated" images output by the generator G of GA-CNTKg taking equidistantly spaced z's along a segment in z space as the input. For ease of presentation, we consider a 2-dimensional z space and train G on MNIST and CelebA datasets of 256 examples. Figure 20 shows the results, where the generated patterns transit smoothly across the 2D z space, and neighboring images share similar looks. These similar-looking images are generated from adjacent but meaningless z's, suggesting that the learned features encode high-level semantics.

13. CONVERGENCE SPEED AND TRAINING TIME

In this section, we study the time usage for training GA-NTK variants and compare it with the training of GANs. We conduct experiments to investigate the number of iterations and the wall-clock time required to train different methods on different datasets of 256 randomly sampled images. We use the batch-wise GA-CNTK and GA-CNTKg and set the batch size b to 64 for all methods. We run the experiments on a machine with a single NVIDIA Tesla V100 GPU. For DCGAN and LSGAN whose loss scores do not reflect image quality, we monitor the training process manually and stop it as long as the generated images contain recognizable patterns. But these methods do not seem to converge. For other methods, we use the early-stopping with the patience of 10000 steps and delta of 0.05 to determine convergence. The results are shown in Table 9 . As we can see, the number of iterations required by either batch-wise GA-CNTK or GA-CNTKg is significantly smaller than that used by GANs. This justifies our claims in Section 1. However, the batch-wise GA-CNTK and GA-CNTKg run fewer iterations per second than GANs because of the higher computation cost involved in back-propagating through K b,b . In terms of wall-clock time, the batch-wise GA-CNTK is the fastest while the GA-CNTKg runs as fast as WGAN-GP. We expect that, with the continuous optimization of the Neural Tangents library (Novak et al., 2019a) which our code is based on, the training speed of GA-NTK variants can be further improved. 

