PRECONDITION LAYER AND ITS USE FOR GANS

Abstract

One of the major challenges when training generative adversarial nets (GANs) is instability. To address this instability spectral normalization (SN) is remarkably successful. However, SN-GAN still suffers from training instabilities, especially when working with higher-dimensional data. We find that those instabilities are accompanied by large condition numbers of the discriminator weight matrices. To improve training stability we study common linear-algebra practice and employ preconditioning. Specifically, we introduce a preconditioning layer (PC-layer) that performs a low-degree polynomial preconditioning. We use this PC-layer in two ways: 1) fixed preconditioning (FPC) adds a fixed PC-layer to all layers; and 2) adaptive preconditioning (APC) adaptively controls the strength of preconditioning. Empirically, we show that FPC and APC stabilize training of unconditional GANs using classical architectures. On LSUN 256 ⇥ 256 data, APC improves FID scores by around 5 points over baselines.

1. INTRODUCTION

Generative Adversarial Nets (GANs) (Goodfellow et al., 2014) successfully transform samples from one distribution to another. Nevertheless, training GANs is known to be challenging, and its performance is often sensitive to hyper-parameters and datasets. Understanding the training difficulties of GAN is thus an important problem. Recent studies in neural network theory (Pennington et al., 2017; Xiao et al., 2018; 2020) suggest that the spectrum of the input-output Jacobian or neural tangent kernel (NTK) is an important metric for understanding training performance. While directly manipulating the spectrum of the Jacobian or NTK is not easy, a practical approach is to manipulate the spectrum of weight matrices, such as orthogonal initialization (Xiao et al., 2018) . For a special neural net, Hu et al. (2020) showed that orthogonal initialization leads to better convergence result than Gaussian initialization, which provides early theoretical evidence for the importance of manipulating the weight matrix spectrum. Motivated by these studies, we suspect that an 'adequate' weight matrix spectrum is also important for GAN training. Indeed, one of the most popular techniques for GAN training, spectral normalization (SN) (Miyato et al., 2018) , manipulates the spectrum by scaling all singular values by a constant. This ensures the spectral norm is upper bounded. However, we find that for some hyperparameters and for high-resolution datasets, SN-GAN fails to generate good images. In a study we find the condition numbers of weight matrices to become very large and the majority of the singular values are close to 0 during training. See Fig. 1 (a) and Fig. 2(a) . This can happen as SN does not promote a small condition number. This finding motivates to reduce the condition number of weights during GAN training. Recall that controlling the condition number is also a central problem in numerical linear algebra, known as preconditioning (see Chen (2005) ). We hence seek to develop a "plug-in" preconditioner for weights. This requires the preconditioner to be differentiable. Out of various preconditioners, we find the polynomial preconditioner to be a suitable choice due to the simple differentiation and strong theoretical support from approximation theory. Further, we suggest to adaptively adjust the strength of the preconditioner during training so as to not overly restrict the expressivity. We show the efficacy of preconditioning on CIFAR10 (32 ⇥ 32), STL (48 ⇥ 48) and LSUN bedroom, tower and living room (256 ⇥ 256).

Summary of contributions.

For a deep linear network studied in (Hu et al., 2020) , we prove that if all weight matrices have bounded spectrum, then gradient descent converges to global min- imum at a geometric rate. We then introduce a PC-layer (preconditioning layer) that consists of a low-degree polynomial preconditioner. We further study adaptive preconditioning (APC) which adaptively controls the strength of PC on different layers in different iterations. Applying PC and APC to unconditional GAN training on LSUN data (256 ⇥ 256), permits to generate high-quality images when SN-GAN fails. We also show that APC achieves better FID scores on CIFAR10, STL, and LSUN than a recently proposed method of Jiang et al. (2019) .

1.1. RELATED WORK

Related to the proposed method is work by Jiang et al. (2019) , which also controls the spectrum in GAN training. They re-parameterize a weight matrix W via W = USV T , add orthogonal regularization of U, V and certain regularizer of entries of the diagonal matrix S. This approach differs from ours in a few aspects. First, Jiang et al. (2019) essentially solves a constrained optimization problem with constraints U T U = I, V T V = I using a penalty method (Bertsekas, 1997) . In contrast, our approach solves an unconstrained problem since we add one layer into the neural net, similar to batch normalization (BN) (Ioffe & Szegedy, 2015) and SN (Miyato et al., 2018) . Second, our PC layer is a direct generalization of SN as it includes SN-layer as a special case. In contrast, the method of Jiang et al. (2019) differs from SN-layer in any case. Our proposed method thus offers a smoother transition for existing users of SN. In a broader context, a number of approaches have been proposed to stabilize and improve GAN training, such as modifying the loss function (Arjovsky et al., 2017; Arjovsky & Bottou, 2017; Mao et al., 2017; Li et al., 2017b; Deshpande et al., 2018) , normalization and regularization (Gulrajani et al., 2017; Miyato et al., 2018) , progressive growing techniques (Karras et al., 2018; Huang et al., 2017) , changing architecture (Zhang et al., 2019; Karnewar & Wang, 2019) , utilizing side information like class labels (Mirza & Osindero, 2014; Odena et al., 2017; Miyato & Koyama, 2018) . Using this taxonomy, our approach fits the "normalization and regularization" category (even though our method is not exactly normalization, the essence of "embedded control" is similar). Note that these directions are relatively orthogonal, and our approach can be potentially combined with other techniques such as progressive growing. However, due to limited computational resources, we focus on unconditional GANs using classical architectures, the setting studied by Miyato et al. (2018) .

1.2. NOTATION AND DEFINITION

We use eig(A) to denote the multiset (i.e., allow repetition) of all eigenvalues of A. If all eigenvalues of A are non-negative real numbers, we say A is a positive semidefinite (PSD) matrix. The singular values of a matrix A 2 R n⇥m are the square root of the eigenvalues of A T A 2 R m⇥m . Let max (A) and min (A) denote the maximum and minimum singular values of A. Let kAk 2 denote the spectral norm of A, i.e., the largest singular value. The condition number of a square matrix A is traditionally defined as (A) = kAk 2 kA 1 k 2 = max (A) min(A) . We extend this definition to a rectangular matrix A 2 R n⇥m where n m via (A) = max (A) min(A) . Let deg(p) denote the degree of a polynomial p and let P k = {p | deg(p)  k} be the set of polynomials with degree no more than k.

2. WHY CONTROLLING THE SPECTRUM?

To understand why controlling the spectrum is helpful we leverage recent tools in neural network theory to prove the following result: if weight matrices have small condition numbers, then gradient descent for deep pyramid linear networks converges to the global-min fast. This is inspired by Hu et al. (2020) analyzing a deep linear network to justify orthogonal initialization. Similar to Hu et al. (2020) , we consider a linear network that takes an input x 2 R dx⇥1 and outputs F (✓; x) = W L W L 1 . . . W 1 x 2 R dy⇥1 , where ✓ = (W 1 , . . . , W L ) represents the collection of all parameters and W j is a matrix of dimension d j ⇥ d j 1 , j = 1, . . . , L. Here we define d 0 = d x and d L = d y . Assume there exists r 2 {1, . . . , L}, such that d y = d L  d L 1  • • •  d r , and n  d 0  d 1  • • •  d r . This means the network is a pyramid network, which generalizes the equal-width network of Hu et al. (2020) . Suppose y = (y 1 ; . . . ; y n ) 2 R ndy⇥1 are the labels, and the predictions are F (✓; X) = (F (✓; x 1 ); . . . , F (✓; x n )) 2 R ndy⇥1 . We consider a quadratic loss L(✓) = 1 2 ky F (✓; X)k 2 . Starting from ✓(0), we generate ✓(k) = (W 1 (k), . . . , W L (k)), k = 1, 2, . . . via gradient descent: ✓(k + 1) = ✓(k) ⌘rL(✓(k)). Denote the residual e(k) = F (✓(k); X) y. For given ⌧ l 1, µ l 0, l = 1, . . . , L, define R , {✓ = (W 1 , . . . , W L ) | ⌧ l max (W l ) min (W l ) µ l , 8l}. , LkXk 2 ⌧ L . . . ⌧ 1 (ke(0)k + kXk F ⌧ L . . . ⌧ 1 ) , µ , (µ 1 . . . µ L ) 2 min (X) 2 . The following result states that if ✓(k) stay within region R (i.e., weight matrices have bounded spectrum) for k = 0, 1, . . . , K, then the loss decreases at a geometric rate until iteration K. The rate (1 µ ) depends on (⌧ L ...⌧1) 2 (µ L ...µ1) 2 , which is related to the condition numbers of all weights. Theorem 1 Suppose ⌘ = 1 . Assume ✓(k) 2 R, k = 0, 1, . . . , K. Then we have ke(k + 1)k 2  (1 µ )ke(k)k 2 , k = 0, 1, . . . , K. See Appendix D.3.1 for the proof and detailed discussions. For proper initial point ✓(0) where W l (0)'s are full-rank, we can always pick ⌧ l , l so that ✓(0) 2 R. The trajectory {✓(k)} either stays in R forever (in which case K = 1), or leaves R at some finite iteration K. In the former case, the loss converges to zero at a geometrical rate; in the latter case, the loss decreases to below (1 µ/ ) K ke(0)k 2 . However, our theorem does not specify how large K is for a given situation. Previous works on convergence (e.g., Hu et al., 2020; Du et al., 2018; Allen-Zhu et al., 2019; Zou et al., 2018) bound the movement of the weights with extra assumptions, so that the trajectory stays in a certain nice regime (related to R). We do not attempt to prove the trajectory stays in R. Instead, we use this as a motivation for algorithm design: if we can improve the condition numbers of weights during training, then the trajectory may stay in R for a longer time, and thus lead to smaller loss values. Next, we present the preconditioning layer as such a method.

3. PRECONDITIONING LAYER

In the following, we first introduce classical polynomial preconditioners in Sec. 3.1. We then present the preconditioning layer for deep nets in Sec. 3.2. We explain how to compute a preconditioning polynomial afterwards in Sec. 3.3, and finally present an adaptive preconditioning in Sec. 3.4.

3.1. PRELIMINARY: POLYNOMIAL PRECONDITIONER

Preconditioning considers the following classical question: for a symmetric matrix Q, how to find an operator g such that (g(Q)) is small? Due to the importance of this question and the wide applicability there is a huge literature on preconditioning. See, e.g., Chen (2005) for an overview, and Appendix B for a short introduction. In this work, we focus on polynomial preconditioners (Johnson et al., 1983) . The goal is to find a polynomial p such that p(Q)Q has a small condition number. The matrix p(Q) is often called preconditioner, and ĝ(Q) , p(Q)Q is the precondtioned matrix. We call g the preconditioning polynomial. Polynomial preconditioning has a special merit: the difficult problem of manipulating eigenvalues can be transformed to manipulating a 1-d function, based on the following fact (proof in Appendix E.2.1). Claim 3.1 Suppose ĝ is any polynomial, and Q 2 R m⇥m is a real symmetric matrix with eigen- values 1  • • •  m . Then the eigenvalues of the matrix ĝ(Q) are ĝ( i ), i = 1, . . . , m. As a corollary, if ĝ([ 1 , m ]) ✓ [1 ✏, 1], then eig(ĝ(Q)) ✓ [1 ✏, 1]. To find a matrix ĝ(Q) = p(Q)Q that is well-conditioned, we need to find a polynomial p such that ĝ(x) = p(x)x maps [ 1 , m ] to [1 ✏, 1]. This can be formulated as a function approximation problem: find a polynomial ĝ(x) of the form xp(x) that approximates a function f ( ) in 2 [ 1 , m ]. Under some criterion, the optimal polynomial is a variant of the Chebychev polynomial, and the solutions to more general criteria are also well understood. See Appendix B.1 for more. A scaling trick is commonly used in practice. It reduces the problem of preconditioning Q to the problem of preconditioning a scaled matrix Q sca = Q/ m . Scaling employs two steps: first, we find a polynomial g that approximates f (x) = 1 in x 2 [ 1 / m , 1]; second, set ĝ( ) = g( / m ) and use ĝ(Q) = g(Q/ m ) = g(Q sca ) as the final preconditioned matrix. It is easy to verify ĝ approximates f ( ) = 1 in [ 1 , m ] . Thus this approach is essentially identical to solving the approximation problem in [ 1 , m ]. Johnson et al. (1983) use this trick mainly to simplify notation since they can assume m = 1 without loss of generality. We will use this scaling trick for a different purpose (see Section 3.3).

3.2. PRECONDITIONING LAYER IN DEEP NETS

Suppose D(W 1 , . . . , W L ) is a deep net parameterized by weights W 1 , . . . , W L for layers l 2 {1, . . . , L}. To control the spectrum of a weight W l , we want to embed a preconditioner ĝ into the neural net. Among various preconditioners, polynomial ones are appealing since their gradient is simple and permits natural integration with backpropagation. For this we present a preconditioning layer (PC-layer) as follows: a PC-layer ĝ(W ) = g(SN(W )) is the concatenation of a preconditioning polynomial g and an SN operation of (Miyato et al., 2018) (see Appendix app sub: details of FPC and APC for details of SN(W )). The SN operator is used as a scaling operator (reason explained later). We describe an efficient implementation of PC-layer in Appendix C.3. In our case, we use A = SN(W ) to indicate the scaled matrix. Prior work on polynomial preconditioners (Johnson et al., 1983; Chen, 2005) often study square matrices. To handle rectangular matrices, some modifications are needed. A naïve solution is to apply a preconditioner to the symmetrized matrix A T A, leading to a matrix g(A) = p(A T A)A T A. This solution works for linear models (see Appendix B.2 for details), but it is not appropriate for deep nets since the shape of p(A T A)A T A 2 R m⇥m differs from A. To maintain the shape of size n ⇥ m, we propose to transform A to g(A) = p(AA T )A 2 R n⇥m . This transformation works for general parameterized models including linear models and neural nets. For a detailed comparison of these two approaches, see Appendix B.2. The following claim relates the spectrum of A and p(AA T )A; see the proof in Appendix E.2.2. To find a polynomial p such that g(A) = p(AA T )A is well-conditioned, we need to find a polynomial p such that g(x) = p(x 2 )x maps [ 1 , m ] to [1 ✏, 1] for some ✏. This can be formulated as a function approximation problem: find a polynomial g(x) in G k that approximates a function Claim 3.2 Suppose A 2 R n⇥m has singular values 1  • • •  m . Suppose g(x) = p(x 2 f ( ) = 1 in 2 [ 1 , m ] where G k = {g(x) = xp(x 2 ) | p 2 P k }. We will describe the algorithm for finding the preconditioning polynomial g in Sec. 3.3. In principle, the PC-layer can be added to any deep net including supervised learning and GANs. Here, we focus on GANs for the following reason. Current algorithms for supervised learning already work quite well, diminishing the effect of preconditioning. In contrast, for GANs, there is a lot of room to improve training. Following SN-GAN which applies SN to the discriminator of GANs, in the experiments we apply PC to the discriminator.

3.3. FINDING PRECONDITIONING POLYNOMIALS

In this subsection, we discuss how to generate preconditioning polynomials. This generation is done off-line and independent of training. We will present the optimization formulation and discuss the choice of a few hyperparameters such as the desirable range and the target function f . Optimization formulation. Suppose we are given a range [ L , U ], a target function f , and an integer k; the specific choices are discussed later. Suppose we want to solve the following approximation problem: find the best polynomial of the form g( x) = x(a 0 + a 1 x 2 + • • • + a k x 2k ) that approximates f (x) in domain [ L , U ], i.e., solve min g2G k d [ L , U ] (g(x), f(x)), where G k = {g(x) = xp(x 2 ) | p 2 P k }, d [ L , U ] is a distance metric on the function space C[ L , U ], such as the `1 distance d [ L , U ] (f, g) = max t2[ L , U ] |f (t) g(t)|. We consider a weighted least-square problem suggested by Johnson et al. (1983) : min g2G k Z U L |g(x) f (x)| 2 w(x)dx, where w(x) = x ↵ is a weight function used in (Johnson et al., 1983) . We discretize the objective and solve the finite-sample version of Eq. ( 4) as follows: min c=(c0,c1,...,c k )2R k+1 n X i=1 x i k X t=0 c t x 2t i f (x i ) 2 w(x i ), where x i 2 [ L , U ], 8i (e.g., drawn from uniform distribution on [ L , U ]). This is a weighted least squares problem (thus convex) that can be easily solved by a standard solver. Choice of desirable range [ L , U ]. The range [ L , U ] within which we want to approximate the target function is often chosen to be the convex hull of the singular values of the matrix to be preconditioned. For the original matrix W , the desirable range [ L , U ] = [ min (W ), max (W )]. However, this range varies across different layers and different iterations. For this reason we scale each W by 1/kW k 2 to obtain A so that its singular values lie in a fixed range [0, 1]. Note that a more precise range is [ min (A)/ max (A), 1], but we can relax it to [0, 1]. We follow Miyato et al. (2018) to use one power iteration to estimate the spectral norm W ⇡ kW k 2 , and denote SN(W ) = W/ W . Since W is not exactly kW k 2 , the range of singular values of A = SN(W ) may not be exactly in [0, 1]. We have checked the empirical estimation and found that the estimated spectral norm during the training of SN-GAN is often less than 1.1 times the true spectral norm (see Fig. 5 in Appendix E.1), thus we pick [ L , U ] = [0, 1.1] in our implementation. Choice of target function f ( ). Previously, we discuss the ideal situation that [ L , U ] = [ 1 , m ], thus the target function is 1. In the previous paragraph, we have relaxed the desirable range to [0, U ], then we cannot set f (x) = 1 in [0, U ], because any polynomial g( ) 2 G k must satisfy g(0) = 0, causing large approximation error at = 0. We shall set f (0) = 0. While setting all singular values to 1 is ideal for fast training, this may reduce the expressiveness for deep nets. More specifically, the set of functions {D(W 1 , . . . , W L ) | eig(W T l W l ) ✓ {1}} is smaller than {D(W 1 , . . . , W L ) | eig(W T l W l ) ✓ [ 0 , 1]}, thus forcing all singular values to be 1 The smaller the cutoff, the more aggressive the preconditioner. For instance, when the cutoff is 0.3, the preconditioner pushes all singular values above 0.3 to 1. We show the fitting polynomials of degree 3, 5, 7, 9 for PL0.8, PL0.6, PL0.4, PL0.3 respectively. More details are in Sec. C.2 may hurt the representation power. Therefore, we do not want the target function to have value 1 in [ min (A), U ]. In practice, the value of min (A) varies for different problems, therefore we permit a flexible target function f , to be chosen by a user. In our implementation, we restrict target functions to a family of piece-wise linear functions. We use PL b (x) with a relatively large cutoff point b, such as 0.8 and 0.3. We plot our candidate target functions PL 0.3 , PL 0.4 , PL 0.6 and PL 0.8 in Figure 3 . As the cutoff point b changes from 1 to 0, the function PL b becomes more aggressive as it pushes more singular values to 1. As a result, the optimization will likely become easier, while the representation power becomes weaker. The exact choice of the target function is likely problem-dependent, and we discuss two strategies to select them in Section 3.4. Search space of preconditioning polynomial. As mentioned earlier, the default search space is G k = {g(x) = xp(x 2 ) | p 2 P k } for a pre-fixed k. The degree of g( ) is an important hyperparameter. On the one hand, the higher the degree, the better the polynomial can fit the target function f . On the other hand, higher degree leads to more computation. In our implementation, we consider k = 1, 2, 3, 4, i.e., polynomials of degree 3, 5, 7 and 9. The extra time is relatively small; see Section C.4 for details.

3.4. FIXED PRECONDITIONING AND ADAPTIVE PRECONDITIONING

The preconditioning polynomial can be determined by the target function and the degree k. Which polynomial shall we use during training? Candidate preconditioners. At first sight, there are two hyper-parameters b and k. Nevertheless, if b is small (steep slope), then it is hard to approximate PL b by low-order polynomials. For each degree k 2 {3, 5, 7, 9}, there is a certain b k such that b < b k leads to large approximation error. We find that b 3 ⇡ 0.8, b 5 ⇡ 0.6, b 7 ⇡ 0.4, b 9 ⇡ 0.3. After fitting PL 0.3 , PL 0.4 , PL 0.6 and PL 0.8 , the resulting polynomials are shown in Figure 3 . A natural approach is to add the PC-layer to all layers of the neural net, resulting in a preconditioned net D PC (✓) = D(g(SN(W 1 )), . . . , g(SN(W L )). We call this method fixed preconditioning (FPC). Just like other hyperparameters, in practice, we can try various preconditioners and pick the one with the best performance. Not surprisingly, the best preconditioner varies for different datasets. Adaptive preconditioning (APC). Motivated by adaptive learning rate schemes like Adam (Kingma & Ba, 2014) and LARS (You et al., 2017) , we propose an adaptive preconditioning scheme. In APC, we apply the preconditioner in a epoch-adaptive and layer-adaptive manner: at each epoch and for each layer the algorithm will automatically pick a proper preconditioner based on the current condition number. The standard condition number (A) = max (A) min(A) is not necessarily a good indicator for the optimization performance. In APC, we use a modified condition number (A) = max (A) ( P m 0 i=1 i (A))/m0 . where A has m columns and m 0 = d m 10 e. We prepare r preconditioning polynomials g 1 , . . . , g r with different strength (e.g., the four polynomials g 1 , g 2 , g 3 , g 4 shown in Figure 3 ). We set a number of ranges [0, ⌧ 1 ], [⌧ 1 , ⌧ 2 ], . . . , [⌧ r , 1] and let ⌧ 0 = 0, ⌧ r+1 = 1. If the modified condition number of A falls into the range [⌧ i , ⌧ i+1 ] for i 2 {0, 1, . . . , r}, we will use g i in the PC-layer. In our im-plementation, we set r = 4. To save computation, we only compute the modified condition number and update PC strength at a fixed interval (e.g., every 1000 iterations). The summary of APC is presented in Table 3 in Appendix C.2. Computation time. We use a few implementation tricks; see Appendix C.3. In our implementation of FPC with a degree 3, 5, 7 or 9 polynomial, the actual added time is around 20 30% (Fig. 4 (a) ) of the original training time of shows that the extra time of APC over SN is often less than 10%. See Appendix C.4 for more on the computation time.

4. EXPERIMENTAL RESULTS

We will demonstrate the following two findings. First, SN-GAN still suffers from training instabilities, and the failure case is accompanied by large condition numbers. Second, PC layers can reduce the condition number, and improve the final performance, especially for high resolution data (LSUN 256 ⇥ 256). We conduct a set of experiments for unconditional image generation on CIFAR-10 (32 ⇥ 32), STL-10 (48 ⇥ 48), , and . We also compare the condition numbers of the discriminator layers for different normalization methods to demonstrate the connection between the condition number and the performance. The following methods are used in our experiments: standard SN; SVD with D-Optimal Reg. (Jiang et al., 2019) ; FPC with degree 3 or 7 preconditioners; APC. Following Miyato et al. (2018) , we use the log loss GAN on the CNN structure and the hinge loss GAN on the ResNet structure. CIFAR and STL: Training failure of (1,1)-update. Tuning a GAN is notoriously difficult and sensitivity to hyper-parameters. Even for low-resolution images, without prior knowledge of good hyper-parameters such as D it , G it , training a GAN is often not trivial. On CIFAR10, SN-GAN uses D it = 5, G it = 1 for ResNet; for simplicity, we call it a (5, 1)-update. However, using a (1, 1)update, i.e., changing D it = 5 to D it = 1 while keeping G it = 1, will lead to an SN-GAN training failure: a dramatic decrease of final performance and an FID score above 77. SN-GAN with (1, 1)update also fails on STL data, yielding an FID score above 147. We are interested in stabilizing the (1, 1)-update for two reasons: first, trainability for both (1, 1)-update and (5, 1)-update means improved training stability; second, the (1, 1)-update requires only about 1/3 of the time compared to the (5, 1)-update. Therefore, in the first experiment, we explore GAN-training with (1, 1)-update on CIFAR-10 and STL-10. Failure mode: large condition numbers. Understanding the failure mode of training is often very useful for designing algorithms (e.g., Glorot & Bengio, 2010) . We suspect that a large condition number is a failure mode for GAN training. As Table 1 shows, the high FID scores (bad case) of SN-GAN are accompanied by large condition numbers. PC reduces condition numbers and rectifies failures. Table 1 shows that FPC and APC can both greatly improve the training performance: they reduce FID from 77 to less than 20 for CIFAR-10, and reduce FID from 147 to less than 34 for STL in 200k iterations. The evolution of the 5 smallest singular values of the adaptive preconditioned matrices and the condition numbers are showed in Fig. 1 Experiments on "good" case of SN-GAN. We report the results for the (5, 1)-update on CIFAR-10 and STL-10 with ResNet in the Appendix. For those FPC and APC achieve similar or slightly better FID scores. We also report IS scores there. We also list the results of PC and multiple baselines on the CNN structure in the Appendix. 4-6 FID scores in most cases. Also note, the condition numbers of the failure case of SN-GAN are much higher than the two normal cases of SN-GAN. In all cases, FPC and APC achieve significantly lower condition numbers than SN-GAN. APC achieves higher condition numbers than FPC, and also better FID scores. We suspect that FPC over-controls the condition numbers which leads to lower representation power. In contrast, APC strikes a better balance between representation and optimization than FPC. The generated image samples are presented in Appendix F.5.

5. CONCLUSION

We prove that for a deep pyramid linear networks, if all weight matrices have bounded singular values throughout training, then the algorithm converges to a global minimal value at a geometric rate. This result indicates that small weight matrix condition numbers are helpful for training. We propose a preconditioning (PC) layer to improve weight matrix condition numbers during training, by leveraging tools from polynomial preconditioning literature. It is differentiable, and thus can be plugged into any neural net. We propose two methods to utilize the PC-layer: in FPC (fixed preconditioning), we add a fixed PC-layer to all layers; in APC (adaptive preconditioning), we add PC-layers with different preconditioning power depending on the condition number. Empirically, we show that applying FPC and APC to GAN training, we can generate good images in a few cases that SN-GAN perform badly, such as LSUN-bedroom 256⇥256 image generation.



Figure 1: Evolution of the 5 smallest singular values of (a) SN-GAN, FID 147.9 and (b) APC-GAN, FID 34.08 when generating STL-10 images with a ResNet trained with Dit = 1 for 200k iterations. The max singular value is around 1 due to SN, thus not shown here.

)x where p is a polynomial. Then the singular values of g(A) = p(AA T )A are |g( 1 )|, . . . , |g( m )|.

A candidate target function is PL b (x), where PL b (x) = ⇢ x/b, x < b 1, x b is defined as a piece-wise linear function with cutoff point b. If the cutoff point b < min (A), then PL b ( ) maps all singular values of A to 1.

Figure3: Left: Different piecewise linear functions; Right: the corresponding fitted preconditioning polynomials. The smaller the cutoff, the more aggressive the preconditioner. For instance, when the cutoff is 0.3, the preconditioner pushes all singular values above 0.3 to 1. We show the fitting polynomials of degree 3, 5, 7, 9 for PL0.8, PL0.6, PL0.4, PL0.3 respectively. More details are in Sec. C.2

(b) and Fig. 2(b) for STL-10 training on ResNet with D it = 1. This shows that PC-GAN successfully improves the spectrum of weight matrices in this setting.

High resolution images LSUN. Using high-resolution data is more challenging. We present numerical results on LSUN bedroom (128 ⇥ 128, and 256 ⇥ 256) , LSUN tower (256 ⇥ 256) and LSUN living room (256 ⇥ 256) data in Table 2. The training time for one instance is 30 hours on a single RTX 2080 Ti (200k iterations).Note, SN-GAN is unstable and results in FID > 80 for LSUN-bedroom 256 ⇥ 256. The SVD method, our FPC and APC generate reasonable FID scores on all three data sets. Importantly, our FPC is comparable or better than SVD, and our APC consistently outperforms the SVD method by Comparison of SN-GAN and PC-GAN, using ResNet with Dit = 1. Here Wl is the preconditioned weighted matrix (i.e., after applying preconditioning). "2nd max L l=1 ( Wl )" indicates the second largest condition number of all layers. "Avg ( Wl )" indicates the average of all layer condition numbers.

Results on LSUN data.

