PRECONDITION LAYER AND ITS USE FOR GANS

Abstract

One of the major challenges when training generative adversarial nets (GANs) is instability. To address this instability spectral normalization (SN) is remarkably successful. However, SN-GAN still suffers from training instabilities, especially when working with higher-dimensional data. We find that those instabilities are accompanied by large condition numbers of the discriminator weight matrices. To improve training stability we study common linear-algebra practice and employ preconditioning. Specifically, we introduce a preconditioning layer (PC-layer) that performs a low-degree polynomial preconditioning. We use this PC-layer in two ways: 1) fixed preconditioning (FPC) adds a fixed PC-layer to all layers; and 2) adaptive preconditioning (APC) adaptively controls the strength of preconditioning. Empirically, we show that FPC and APC stabilize training of unconditional GANs using classical architectures. On LSUN 256 ⇥ 256 data, APC improves FID scores by around 5 points over baselines.

1. INTRODUCTION

Generative Adversarial Nets (GANs) (Goodfellow et al., 2014) successfully transform samples from one distribution to another. Nevertheless, training GANs is known to be challenging, and its performance is often sensitive to hyper-parameters and datasets. Understanding the training difficulties of GAN is thus an important problem. Recent studies in neural network theory (Pennington et al., 2017; Xiao et al., 2018; 2020) suggest that the spectrum of the input-output Jacobian or neural tangent kernel (NTK) is an important metric for understanding training performance. While directly manipulating the spectrum of the Jacobian or NTK is not easy, a practical approach is to manipulate the spectrum of weight matrices, such as orthogonal initialization (Xiao et al., 2018) . For a special neural net, Hu et al. (2020) showed that orthogonal initialization leads to better convergence result than Gaussian initialization, which provides early theoretical evidence for the importance of manipulating the weight matrix spectrum. Motivated by these studies, we suspect that an 'adequate' weight matrix spectrum is also important for GAN training. Indeed, one of the most popular techniques for GAN training, spectral normalization (SN) (Miyato et al., 2018) , manipulates the spectrum by scaling all singular values by a constant. This ensures the spectral norm is upper bounded. However, we find that for some hyperparameters and for high-resolution datasets, SN-GAN fails to generate good images. In a study we find the condition numbers of weight matrices to become very large and the majority of the singular values are close to 0 during training. See Fig. 1 (a) and Fig. 2(a) . This can happen as SN does not promote a small condition number. This finding motivates to reduce the condition number of weights during GAN training. Recall that controlling the condition number is also a central problem in numerical linear algebra, known as preconditioning (see Chen ( 2005)). We hence seek to develop a "plug-in" preconditioner for weights. This requires the preconditioner to be differentiable. Out of various preconditioners, we find the polynomial preconditioner to be a suitable choice due to the simple differentiation and strong theoretical support from approximation theory. Further, we suggest to adaptively adjust the strength of the preconditioner during training so as to not overly restrict the expressivity. We show the efficacy of preconditioning on CIFAR10 (32 ⇥ 32), STL (48 ⇥ 48) and LSUN bedroom, tower and living room (256 ⇥ 256).

Summary of contributions.

For a deep linear network studied in (Hu et al., 2020) , we prove that if all weight matrices have bounded spectrum, then gradient descent converges to global min- imum at a geometric rate. We then introduce a PC-layer (preconditioning layer) that consists of a low-degree polynomial preconditioner. We further study adaptive preconditioning (APC) which adaptively controls the strength of PC on different layers in different iterations. Applying PC and APC to unconditional GAN training on LSUN data (256 ⇥ 256), permits to generate high-quality images when SN-GAN fails. We also show that APC achieves better FID scores on CIFAR10, STL, and LSUN than a recently proposed method of Jiang et al. ( 2019).

1.1. RELATED WORK

Related to the proposed method is work by Jiang et al. ( 2019), which also controls the spectrum in GAN training. They re-parameterize a weight matrix W via W = USV T , add orthogonal regularization of U, V and certain regularizer of entries of the diagonal matrix S. This approach differs from ours in a few aspects. First, Jiang et al. ( 2019) essentially solves a constrained optimization problem with constraints U T U = I, V T V = I using a penalty method (Bertsekas, 1997). In contrast, our approach solves an unconstrained problem since we add one layer into the neural net, similar to batch normalization (BN) (Ioffe & Szegedy, 2015) and SN (Miyato et al., 2018) . Second, our PC layer is a direct generalization of SN as it includes SN-layer as a special case. In contrast, the method of Jiang et al. ( 2019) differs from SN-layer in any case. Our proposed method thus offers a smoother transition for existing users of SN. In a broader context, a number of approaches have been proposed to stabilize and improve GAN training, such as modifying the loss function (Arjovsky et al., 2017; Arjovsky & Bottou, 2017; Mao et al., 2017; Li et al., 2017b; Deshpande et al., 2018) , normalization and regularization (Gulrajani et al., 2017; Miyato et al., 2018) , progressive growing techniques (Karras et al., 2018; Huang et al., 2017) , changing architecture (Zhang et al., 2019; Karnewar & Wang, 2019) , utilizing side information like class labels (Mirza & Osindero, 2014; Odena et al., 2017; Miyato & Koyama, 2018) . Using this taxonomy, our approach fits the "normalization and regularization" category (even though our method is not exactly normalization, the essence of "embedded control" is similar). Note that these directions are relatively orthogonal, and our approach can be potentially combined with other techniques such as progressive growing. However, due to limited computational resources, we focus on unconditional GANs using classical architectures, the setting studied by Miyato et al. (2018) .

1.2. NOTATION AND DEFINITION

We use eig(A) to denote the multiset (i.e., allow repetition) of all eigenvalues of A. If all eigenvalues of A are non-negative real numbers, we say A is a positive semidefinite (PSD) matrix. The singular values of a matrix A 2 R n⇥m are the square root of the eigenvalues of A T A 2 R m⇥m . Let max (A) and min (A) denote the maximum and minimum singular values of A. Let kAk 2 denote



Figure 1: Evolution of the 5 smallest singular values of (a) SN-GAN, FID 147.9 and (b) APC-GAN, FID 34.08 when generating STL-10 images with a ResNet trained with Dit = 1 for 200k iterations. The max singular value is around 1 due to SN, thus not shown here.

