DOUBLE DYNAMIC SPARSE TRAINING FOR GANS

Abstract

The past decade has witnessed a drastic increase in modern deep neural networks (DNNs) size, especially for generative adversarial networks (GANs). Since GANs usually suffer from high computational complexity, researchers have shown an increased interest in applying pruning methods to reduce the training and inference costs of GANs. Among different pruning methods invented for supervised learning, dynamic sparse training (DST) has gained increasing attention recently as it enjoys excellent training efficiency with comparable performance to post-hoc pruning. Hence, applying DST on GANs, where we train a sparse GAN with a fixed parameter count throughout training, seems to be a good candidate for reducing GAN training costs. However, a few challenges, including the degrading training instability, emerge due to the adversarial nature of GANs. Hence, we introduce a quantity called balance ratio (BR) to quantify the balance of the generator and the discriminator. We conduct a series of experiments to show the importance of BR in understanding sparse GAN training. Building upon single dynamic sparse training (SDST), where only the generator is adjusted during training, we propose double dynamic sparse training (DDST) to control the BR during GAN training. Empirically, DDST automatically determines the density of the discriminator and greatly boosts the performance of sparse GANs on multiple datasets.

1. INTRODUCTION

In the past decade, the training and inference costs of modern deep neural networks (DNNs) are gradually becoming prohibitive (He et al., 2016; Dosovitskiy et al., 2020; Liu et al., 2021d) , especially for large language models (Brown et al., 2020) . Among all these large models, generative adversarial networks (GANs) (Goodfellow et al., 2020) have been widely investigated for years and achieved remarkable results. However, similar to other giant models, GANs are notably computationally intensive. For example, BigGAN (Brock et al., 2018) trained on 8 NVIDIA V100 GPUs with full precision will take 15 days. Consequently, to train GANs in broader resource-constrained scenarios, this computational bottleneck of training needs to be resolved urgently. Neural network pruning has recently emerged as a powerful tool to reduce training and inference costs of DNNs for supervised learning. There are mainly three genres of pruning methods, namely pruning-at-initialization, pruning-during-training, and post-hoc pruning methods. Post-hoc pruning (Janowsky, 1989; LeCun et al., 1989; Han et al., 2015) can date back to the 1980s, which was first introduced for reducing inference time and memory requirements; hence does not align with our purpose of efficient training. Later, pruning-at-initialization (Lee et al., 2018; Wang et al., 2020a; Tanaka et al., 2020) and pruning-during-training methods (Louizos et al., 2017; Wen et al., 2016) were introduced to prune the networks before training and throughout the training, respectively. Most early pruning-during-training algorithms (Savarese et al., 2020) gradually decrease the density of the neural networks and hence do not bring much training efficiency compared to posthoc pruning. However, recent advances in dynamic sparse training (DST) (Evci et al., 2020; Liu et al., 2021a; b; c; Mocanu et al., 2018) for the first time show that pruning-during-training methods can have comparable training FLOPs as pruning-at-initialization methods while having competing performance with respect to post-hoc pruning. Therefore, applying DST on GANs seems to be a promising choice. Although DST has attained remarkable achievements in supervised learning, the application of DST on GANs is less explored due to newly emerged challenges. The main difficulty stems from the fact that the training procedure of GANs is notoriously brittle. To ensure successful training, we usually need carefully chosen architectures and finely-tuned hyper-parameters. One possible cause is the difficulty of balancing the generator and the discriminator throughout training (Bai et al., 2018; Arora et al., 2017) . Specifically, an overly-strong discriminator will lead to overfitting, while a weak discriminator will result in mode collapse. As a consequence, the requirement of balanced training brings even more challenges to sparse GAN training. On the one hand, we find that performance degradation caused by the unbalance of GANs is even more severe when sparsity is introduced. On the other hand, for directly applying DST to the generator (or both) like the pioneering work STU-GAN (Liu et al., 2022) , it is unclear how to determine a reasonable density of the discriminator. To this end, we propose a metric called balance ratio (BR), which measures the degree of balance of the two components, to study sparse GAN training. We find that BR is useful in (1) understanding the interaction between the discriminator and the generator, (2) identifying the cause of training failure, and (3) helping stabilize sparse GAN training as an indicator. To our best knowledge, this is the first study to investigate the balance of sparse GANs and may even provide new insights into dense GAN training. Using BR as an indicator, we further propose double dynamic sparse training (DDST) to adjust the density and the connections of the discriminator automatically during training. Our contributions are summarized below: • We introduce a quantity named balance ratio to quantify the degree of balance in GAN training, which also helps understand the cause of some training failure cases. • We first consider single dynamic sparse training (SDST), which is a generalization of STU-GAN (Liu et al., 2022) : applying DST to only the generator with varying discriminator density ratios. We show that SDST does not necessarily outperform the static sparse training baselines. • We provide two strategies to determine the discriminator density for SDST, and we find that using a relatively larger density usually generates stable and better performance. • Using the balance ratio as an indicator, we propose double dynamic sparse training (DDST), which makes the discriminator dynamic both in density level and parameter level. Empirically, DDST outperforms baselines with reasonable computational cost on several datasets.

2.1. NEURAL NETWORK PRUNING

Based on the smallest granularity of pruned units, neural network pruning can be categorized into structured (Liu et al., 2017; 2018; Huang & Wang, 2018; Luo et al., 2017) and unstructured pruning (Frankle & Carbin, 2018; Han et al., 2015) . In this work, we mainly focus on unstructured pruning where individual weight is the finest resolution. Post-hoc pruning. Post-hoc pruning prunes weights of a fully-trained neural network, and they usually have high computation cost due to the multiple rounds of train-prune-retrain procedure (Han et al., 2015; Renda et al., 2020) . Some use specific criteria (Han et al., 2015; LeCun et al., 1989; Hassibi et al., 1993; Molchanov et al., 2019; Dai et al., 2018; Guo et al., 2016; Dong et al., 2017; Yu et al., 2018) to remove weights, while others perform extra optimization iterations (Verma & Pesquet, 2021) . Post-hoc pruning was initially proposed to reduce the inference time, while later work on lottery ticket works (Frankle & Carbin, 2018; Renda et al., 2020) aimed to mine trainable sub-networks. Pruning-at-initialization methods. SNIP (Lee et al., 2018) is one of the pioneering works which aim to find trainable sub-networks without any training. Some following works (Wang et al., 2020a; Tanaka et al., 2020; de Jorge et al., 2020; Alizadeh et al., 2022) aim to propose different metrics to prune networks at initialization. Among them, Synflow (Tanaka et al., 2020) , SPP (Lee et al., 2019), and FORCE (de Jorge et al., 2020) try to address the problem of layer collapse during pruning. Neural tangent transfer (Liu & Zenke, 2020 ) learns a sub-network by aligning the empirical neural tangent kernel and network output to the dense counterpart. Pruning-during-training methods. Another genre of pruning algorithms gradually prunes dense DNNs throughout training. To mitigate performance drop after pruning, early works add explicit ℓ 0 (Louizos et al., 2017) or ℓ 1 (Wen et al., 2016) regularization terms to encourage sparse solution. Later works learn the subnetworks structures through projected gradient descent (Zhou et al., 2021) or trainable masks (Kang & Han, 2020; Kusupati et al., 2020; Liu et al., 2020; Savarese et al., 2020; Srinivas et al., 2017; Xiao et al., 2019) . However, these pruning-during-training methods often do not enjoy memory sparsity during training. As a remedy, DST methods (Bellec et al., 2017; Dettmers & Zettlemoyer, 2019; Evci et al., 2020; Liu et al., 2021a; b; c; Mocanu et al., 2018; Mostafa & Wang, 2019; Graesser et al., 2022) were introduced to train the neural networks under a given parameter budget while mask change is allowed during training.

2.2. GENERATIVE ADVERSARIAL NETWORKS

Generative adversarial networks (GANs). GANs (Goodfellow et al., 2020) have drawn considerable attention and have been widely investigated for years. Various architectures have been proposed to enhance the capability of GANs. Deep convolutional GANs (Radford et al., 2015) replace fullyconnected layers in the generator and the discriminator. After that, follow-up works (Brock et al., 2018; Gulrajani et al., 2017; Karras et al., 2017; Zhang et al., 2019) employed more advanced methods to improve the fidelity of generated samples. Due to the difficulty of finding Nash Equilibrium, training of GAN is highly unstable. Therefore, several novel loss functions (Mao et al., 2017; Arjovsky et al., 2017; Salimans et al., 2016; Gulrajani et al., 2017; Sun et al., 2020) , normalization and regularization methods (Miyato et al., 2018; Wu et al., 2021; Terjék, 2019) were proposed to stabilize the adversarial training. Besides the efforts devoted to the training of GAN, image-to-image translation is also extensively explored. Specifically, this direction includes semantic image synthesis (Zhu et al., 2017b) , style transfer (Karras et al., 2020b; Choi et al., 2018; Zhu et al., 2017a) , super resolution (Ledig et al., 2017; Wang et al., 2018) etc. GAN compression and pruning. Like other deep neural networks, the training and inference process of GANs requires massive resource consumption and memory. One of the promising ways is based on neural architecture search and distillation algorithm (Li et al., 2020; Fu et al., 2020; Hou et al., 2021) . Another part of the work applied prune-based methods for GANs' generator compression (Shu et al., 2019; Jin et al., 2021; Yu & Pool, 2020 ). Yet, they only focus on the pruning of generators, thus potentially posing a negative influence on Nash Equilibrium between generators and discriminators. Later, works by (Wang et al., 2020b) presented a unified framework by combing the methods mentioned above. Follow-up work by Li et al. (2021) compresses both components of GANs by letting the student GANs also learn the losses. Another line of work (Kalibhat et al., 2021; Chen et al., 2021) tries to test the existence of lottery tickets in GAN. However, most mentioned methods are not prepared for training efficiency and require over-parameterized GAN models in advance. Directly training sparse GANs has been less explored so far. To the best of our knowledge, STU- GAN Liu et al. (2022) is the only work that tries to apply DST to GANs. 3 BALANCE RATIO: QUANTIFYING THE BALANCE OF SPARSE GANS

3.1. PRELIMINARY AND SETUPS

Generative adversarial networks (GANs) have two fundamental components, a generator G(•; θ G ) and a discriminator D(•; θ D ). Specifically, the generator maps a sampled noise z from a multivariate normal distribution p(z) into a fake image to cheat the discriminator, whereas the discriminator distinguishes the generator's output and the real images x r from the distribution q(x). Formally, the optimization objective of the two-player game defined in JS-GAN (Goodfellow et al., 2020) is defined as follows: L(θ D , θ G ) = E xr∼q(x) [log(D(x r ; θ D ))] + E z∼p(z) [log(1 -D(G(z; θ G )))] . To be more specific, different loss can be used, including Wasserstein loss (Gulrajani et al., 2017) and hinge loss (Miyato et al., 2018) . In this work, we use hinge loss for all GANs. GAN sparse training. In this work, we are interested in sparse training for GANs. Specifically, the objective of sparse GAN training can be formulated as: min θ G max θ D L(θ D ⊙ m D , θ G ⊙ m G ) (2) s.t. m D ∈ {0, 1} p D , m G ∈ {0, 1} p G , ∥m D ∥ 0 /p D ≤ d D , ∥m G ∥ 0 /p G ≤ d G , where ⊙ is the Hadamard product; θ D , m D , p D , d D are the sparse solution, mask, number of parameters, and target density for the discriminator, respectively. The corresponding variables for the generator are denoted with subscript G. For pruning-at-initialization methods, masks m are determined before training whereas m are dynamically adjusted for dynamic sparse training (DST) methods.

3.2. BALANCE OF GAN DURING TRAINING

As discussed in Section 1, it is essential to maintain the balance of generator and discriminator during GAN training. As pointed out by Bai et al. (2018) and Arora et al. (2017) , discriminators that are too strong lead to over-fitting, whereas weak discriminators are unable to detect mode collapse. When it comes to sparse GAN training, the consequences caused by the unbalance can be further amplified. Specifically, different from dense GAN training, densities of generators and discriminators can be varied significantly, leading to a more unbalanced worst-case scenario. To support our claim, we conduct experiments with SNGAN (Miyato et al., 2018) on the CIFAR-10 dataset. Following Liu et al. (2022) , we start with static sparse training where densities of generators and discriminators are chosen from {10%, 20%, 30%, 50%, 100%}. Layer-wise sparsity ratio and masks m G , m D are determined using Erdős-Rényi-Kernel (ERK) graph topology (Evci et al., 2020) and are fixed throughout the training. More experiment details can be found at Appendix A. Experiment results. Results are reported in Figure 2 . First three plots in Figure 2 show the results when varying density of discriminator d D for weak generators (d G ∈ {10%, 20%, 30%}). We observe FID first decreases then increases. Specifically, neither overly-weak discriminators nor overly-strong discriminators can provide satisfactory performance. Similarly, for stronger generators (d G ∈ {50%, 100%}), only the dense discriminator with d G = 100% is not too weak to have satisfactory FID result. Hence, to ensure a balanced training of GAN, it is crucial to find the suitable sparsity ratio for the discriminator. 

3.3. BALANCE RATIO

D(X) α β (xr,D(xr))) (G pre (z),D(G pre (z))) (G post (z),D(G post (z))) Balance ratio = α β Figure 1: Illustration of balance ratio. The observation in Section 3.2 raises a fundamental question: is there a way to quantify the degree of balance between the generator and the discriminator? To answer the question, we introduce balance ratio (BR), which is, to the best of our knowledge, the first quantity that measures the balance of sparse generators and discriminators. At each training iteration, we draw random noise z from a multivariate normal distribution and real images x r from the training set. We denote the discriminator after gradient descent update as D(•; θ D ). We denote generator before and after gradient descent training as G pre (•; θ G ) and G post (•; θ ′ G ), respectively. Then the balance ratio is defined as: BR = D(G post (z)) -D(G pre (z)) D(x r ) -D(G pre (z)) = α β . Figure 1 also provides an illustration of BR. Specifically, BR measures how much improvement the generator can achieve in the scale measured by the discriminator for a specific random noise z. When BR is small (e.g., BR< 30%), it means that the updated generator is too weak to trick the discriminator as the generated images are still considered fake. Similarly, for the case where BR is large (e.g., BR> 80%), the discriminator is considered too weak hence it will not provide useful information to the generator. Balance ratio Density of generator dG=100% dD=10% dD=20% dD=30% dD=50% dD=100% Figure 3: Balance ratio of static sparsely trained SNGAN on CIFAR-10 with different sparsity ratio combinations. We now visualize the BR evolution throughout the training for the experiments in Section 3.2. The results are shown in Figure 3 . The effectiveness of BR. We first check if BR can reflect the density increase (hence representation power increase) of the discriminator. In Figure 3 Overly weak discriminators lead to training failure. For the cases where the discriminators are too weak compared to the generators, e.g., all cases where d D = 10%, we are able to observe the strongly oscillatory behavior of BR. More precisely, BR starts to oscillate after it reaches a value that is higher than 1.0. During the experiments, we also empirically observe that the FID gradually increases after such a turning point. As also shown in Figure 2 , FID of overly-strong discriminators (e.g., d D = 100%) are lower than overly-weak discriminators (e.g., d D = 10%). The such phenomenon seems to imply that performance degradation caused by overly-strong discriminators is better than failure caused by overly-weak discriminators.

3.4. DYNAMIC DENSITY ADJUSTMENT OF THE DISCRIMINATORS

We have shown in Section 3.3 that BR is able to capture the degree of balance of the generators and discriminators. Hence, it is natural to leverage BR to dynamically adjust the density of discriminators during sparse GAN training. Specifically, we initialize the initial density of the discriminator d init D = d G . After a specific training iteration interval ∆T D , we will adjust the density of the discriminator based on the time-averaged BR over last a few iterations with a pre-defined density increment ∆d. With a pre-defined BR bounds [B -, B + ], we decrease d D by ∆d when BR is smaller than B -, and vise versa. Notice that the DA algorithm is in spirit very similar to StyleGAN2-ADA (Karras et al., 2020a) which adjusts augmentation probability with ADA. Out of simplicity, we increase the density by growing the connections with the largest gradient magnitude (Evci et al., 2020) . Global magnitude pruning is used to drop connections so as to decrease the density. The algorithm is shown in Appendix C Algorithm 1. We test our proposed methods dynamic density adjust (DA) with two target BR intervals, namely .45, 0.55] ). DA-strong tends to find a relatively stronger discriminator, which results in a lower BR throughout the training, whereas DA-mild tends to make the discriminator and the generator relatively balanced, i.e., BR ≈ 0.5.  DA-strong ([B -, B + ] = [0.3, 0.4]), DA-mild ([B -, B + ] = [0

4. IS ONLY ADJUSTING THE GENERATOR ENOUGH FOR SPARSE GANS?

In this section, we are going to test DST on GANs. We first test SDST, a direct application of DST method on GAN where only the generator dynamically adjusts masks during the training. We do not consider naively applying DST on both generators and discriminators, as in STU-GAN (Liu et al., 2022) , it is empirically shown that adjusting both components simultaneously generates worse performance with more severe training instability. We name such method single dynamic sparse training (SDST) as only one component of the GAN, i.e., the generator, is dynamic. Hence, STU-GAN is a special case for SDST, which grows connections based on gradients.foot_0  We follow the same setting considered in Section 3.2 where the densities of the generators d G and discriminators d D are chosen from {10%, 20%, 30%, 50%, 100%}. Detailed DST procedure and corresponding hyper-parameters can be found in Appendix B. Experiment results. We show the experiment results in Figure 4 . The first observation is that the performance of RigL and SET does not vary much in general. The second observation is that SDST is better than static sparse training when the discriminator is strong enough. More specifically, for d G ∈ {10%, 20%}, SDST method is worse than static sparse training when the density of the discriminator is weak (d D = 10%). On the contrary, when the discriminator is strong enough, d D ∈ {20%, 30%, 50%, 100%}, we see a great performance boost brought by SDST. The reason is that the in-time over-parameterization induced by DST increases the representation power of the generator. Such a boost is beneficial only when the discriminator has matching or better representation power. Despite the superior performance of STU-GAN (or SDST in general) at higher density ratios, there exist some limitations for SDST, which are summarized as follows: ➊ When using SDST, d D is manually chosen before training. However, it is unclear what is a good choice. In real-world scenarios, it is not practical to search for the optimal d D for each d G . ➋ The issue of GAN unbalance is unresolved during training since the density of the discriminator is fixed. As shown in Figure 4 , the best performance is not always obtained with the maximum d D = 100%. If we are using an overly-strong discriminator, we are wasting extra computational cost for a worse performance. Hence, STU-GAN (or SDST in general), which directly applies DST to the generator, may only be useful when the corresponding discriminator is strong enough. In this sense, to deal with more complicated scenarios, obtaining balanced training in an automatic way is essential in GAN dynamic sparse training. 

5. DOUBLE DYNAMIC SPARSE TRAINING FOR GANS

STU-GAN (or SDST in general) considered in the last section cannot generate stable and satisfying performance. This implies that we should utilize the discriminator in a better way rather than just directly applying DST to the discriminator. Consequently, DA (Section 3.4), which adjusts the discriminator density to stabilize GAN training, is a favorable candidate to address the issue. We name the proposed method double dynamic sparse training (DDST), which adjusts the density of the discriminator during training with BR as the indicator while the generator performs DST. We propose two DDST methods, namely R-DDST and S-DDST based on whether we give constraints on the maximum density of the discriminator. We present them in Section 5.1 and Section 5.2. We use the word double for the following two reasons: ➊ both the generators and the discriminators are dynamic (both R-DDST and S-DDST); ➋ the discriminator enjoys two levels of dynamic flexibility, namely density level and parameter level (S-DDST). Such a method has much more flexibility and generates more stable performance compared to SDST.

5.1. RELAXED DOUBLE DYNAMIC SPARSE TRAINING

We first investigate the direct combination of SDST with DA. Specifically, the generator is adjusted using SDST as mentioned in Section 4 while the density of the generator is dynamically adjusted with DA as mentioned in Section 3.4. We call such a combination relaxed double dynamic sparse training (R-DDST) as it does not necessarily introduce sparsity to the discriminator, and the density of the discriminator can be as high as 100% (hence dense discriminator). Datasets, architectures, and target sparsity ratios. We first conduct experiments on SNGAN with ResNet architectures on CIFAR-10 ( Krizhevsky et al., 2009) and STL-10 (Coates et al., 2011) datasets. Target density ratios of the generators d G are chosen from {10%, 20%, 30%, 50%}. Please see Appendix A for more experiment details. Baseline methods and R-DDST. We use static (Section 3.2) and SDST (Section 4) as our baselines. Since these baselines use pre-defined discriminator density ratios, we propose two strategies to define the discriminator density ratios based on the results from Section 3.2 and Section 4: ➊ balance strategy, where we set the density of the discriminator d D the same as the density of the generator d G ; ➋ strong strategy, where we set the density of the discriminator as large as possible, i.e., d D = d max . In this section, we use d max = 100% for the strong strategy. For SDST methods, we test both grow methods, i.e., SDST(SET) (Mocanu et al., 2018) which grows connections randomly and SDST(RigL) (Evci et al., 2020) which grows connections via gradient. Similar to SDST, we again consider R-DDST(SET) and R-DDST(RigL) which differ based on how R-DDST grows connections. One thing to notice is that we use the same growth criterion for the generator and the discriminator out of simplicity. More training details can be found in Appendix B. FID results on the training set are shown in Table 1 . More results of SNGAN on CIFAR-10 test set can be found in Appendix E. Training costs comparison can be found in Appendix G. The strong strategy and the balance strategy. For almost all density ratios of SNGAN (CIFAR-10) experiments, using the strong strategy is always comparable to or better than the balance strategy. The difference between the two is almost negligible when applied to static methods. However, for SDST methods, using stronger discriminators always leads to a large performance gain. For SNGAN on the STL-10 dataset, the advantage of the strong strategy over the balance strategy is no longer obvious. Precisely, for 3 out of 8 cases, using the strong strategy is better than using the balance strategy. The explanation is that the size difference between generators and discriminators is larger for STL-10. Hence, the degree of unbalance is more severe and leads to more detrimental effects. R-DDST identifies reasonable discriminator density. For the CIFAR-10 dataset, we find that R-DDST consistently performs better than the corresponding baselines with the same grow methods. This illustrates that R-DDST is flexible and able to find suitable discriminator density compared to the two baseline strategies, i.e., the strong and the balance strategy. For the STL-10 dataset, R-DDST(RigL) performs consistently better than R-DDST(SET) and baselines, whereas R-DDST(SET) is not competitive any more. We postulate that under such a setting where the dataset scales up and the training is more difficult, gradient growth not only identifies important connections of the generator but also provides efficient representation power growth of the discriminator to balance the growth of the generator. Please also see Appendix D for the time evolution of BR and the discriminator density during training for R-DDST methods. Larger GAN model experiments. We have also conducted experiments with BigGAN (Brock et al., 2018) on CIFAR-10 datasets. Based on the SNGAN results, we compare all RigL variants with static baselines. FID and normalized training FLOPs with respect to dense training are shown in Table 2 . The results show that R-DDST shows stable performance and outperforms other baselines most of the time. Moreover, compared to the second best method SDST-Strong, R-DDST not only shows lower FID but also requires much less training cost. Main takeaway. In this section, we compared R-DDST with sparse training baselines. We find that RigL and strong strategy are preferred compared to SET and balance strategy. SDST(RigL) with strong strategy generally generates better performance compared to other sparse training baselines. Finally, R-DDST(RigL) beats SDST(RigL) with much less computational cost and always ranked top two among all methods.

5.2. STRICT DOUBLE DYNAMIC SPARSE TRAINING

R-DDST introduced in the previous section does not necessarily introduce sparsity for the discriminator, which provides less memory/training resources saving for larger generator density ratios. Hence, we further present strict double dynamic sparse training (S-DDST) in this section which enforces the discriminator to be sparse, i.e., d D ≤ d max < 100%. In this section, we assume that Baseline methods and S-DDST. We use the same baselines and adopt the same general setup in Section 5.1. We divide the training iterations evenly for two phases. For a comprehensive comparison, we continue to report FID results from two growth methods, i.e., S-DDST(SET) and S-DDST(RigL) in Table 3 . IS and other results can be found in Appendix E. S-DDST shows stable and superior performance. For the CIFAR-10 dataset, we notice that S-DDST stably surpasses its corresponding baselines regardless of grow methods and initial density of discriminators and generators. Even with a further constraint on the discriminator, DA is still able to improve GANs training and can explore more reasonable density than the strong and the balance strategy. For STL-10 dataset, S-DDST(RigL) again shows the most promising performance. Please also see Figure 7 in Appendix D for discriminator density and BR evolution during training. Main takeaway. In this section, we report the results from S-DDST(RigL) with its baselines. Generally, RigL still demonstrates encouraging results compared with SET in most experiments when extra sparsity is introduced. While the strong strategy shows favorable performance in the CIFAR-10 dataset, the gain is not salient when the size of the backbone increase and the training dataset scales up to STL-10. Most importantly, S-DDST(RigL) is able to have comparable performance in some cases when compared to R-DDST(RigL) and outperforms SDST(RigL) after we restrict the maximal density of discriminators.

6. CONCLUSION

In this paper, we study DST for GANs. We find that simply applying DST methods to the generator is not sufficient to improve the performance of sparse GANs. Hence, we propose to use BR to measure the degree of unbalance between the generator and the discriminator. We find that the application of DST only on the generator is beneficial when the discriminator is relatively stronger. Furthermore, we propose two methods, namely R-DDST and S-DDST, to dynamically adjust the discriminator in both parameter and density levels. Both of these methods demonstrate encouraging results. Our study may help researchers have a better understanding of the balance of GAN training and encourage more researchers to investigate sparse training for generative models.

7. REPRODUCIBILITY STATEMENT

To ensure reproducibility, we will include a link to an anonymous repository after the discussion forums are open. All the experiment details can be found in Section 4, Section 5.1, Section 5.2, Appendix A and Appendix B. of layer l while w l and h l are the width and the height of the corresponding kernel in that layer. For fully connected layers, Erdős-Rényi (ER) strategy is used, where the sparsity is scaled with 1 -n l-1 +n l n l-1 n l . Drop and grow. After ∆T training iterations, we update the mask m G by dropping/pruning f decay (γ, T )p G d G number of connections with the lowest magnitude, where p G , d G are the number of parameters and target density for the generator, f decay (γ, T ) is the update schedule, which will be explained in the next part. Right after the connection drop, we regrow the same amount of connections. For the growing criterion, we test both random growth SDST(SET) (Mocanu et al., 2018; Liu et al., 2021c) and gradient-based growth SDST(RigL) (Evci et al., 2020) . Concretely, gradientbased methods find newly-activated connections θ with highest gradient magnitude ∂L ∂θ , while random based methods explore connections in a random fashion. All the newly-activated connections are set to 0. One thing that should be noticed is that while previous works consider layer-wise connections drop and growth, we grow and drop connections globally as it grants more flexibility to the SDST method. Update schedule. The update schedule can be specified by the number of training iterations between sparse connectivity updates ∆T , the initial fraction of connections adjusted γ, and decaying schedule f decay (γ, T ) for γ. EMA for sparse GAN. EMA (Yaz et al., 2018) is well-known for its ability to alleviate the nonconvergence of GAN. We also implement EMA for sparse GAN training. Specifically, we zero out the moving average of dropped weights whenever there is a mask change.

B.2 DST HYPERPARAMETERS FOR SDST

SNGAN on the CIFAR-10 and the STL-10 datasets. The connection update frequency of the generator ∆T is set to 500 and 1000 for the CIFAR-10 dataset and STL-10 dataset, respectively. The initial γ is set to 0.5 and we use a cosine annealing function f decay following RigL and ITOP. BigGAN on the CIFAR-10 dataset. The connection update frequency of the generator ∆T is set to be 1000. The initial γ is set to 0.5 and we use a cosine annealing function f decay following RigL and ITOP.

B.3 DYNAMIC ADJUST AND DST HYPERPARAMETERS FOR DDST

R-DDST. For R-DDST, only the generator is adjusted using DST while the discriminator is adjusted using dynamic adjust (DA). The DA bounds are chosen to be [0.475, 0.525], [0.45, 0.55], and [0.45, 0.55] for SNGAN (CIFAR-10), SNGAN (STL-10) and BigGAN (CIFAR-10), respectively. ∆d is set to be 0.05, 0.025, 0.05 for SNGAN (CIFAR-10), SNGAN (STL-10) and BigGAN (CIFAR-10), respectively. The density of the discriminator is adjusted every 1000, 2000, and 5000 iterations for the three settings, respectively. Time-averaged BR over 1000 iterations is used as the indicator. We use the same setting used in Section B.2 for the generator. S-DDST. For S-DDST, the discriminator is adjusted using DA in the first half of training, i.e., the first 50,000 iterations. In the second half of the training, the discriminator is adjusted using DST. The generator is only adjusted with DST. For the DA bounds, they are set as [0.45, 0.55] and [0.475, 0.525] for CIFAR-10 and STL-10 dataset, respectively. The density of the discriminator is adjusted every 2000 iterations for each dataset. The density of the generator is adjusted every 1000 iterations. We compute BR for every iteration to visualize the BR evolution, whereas one should note that such computational cost can be greatly decreased if BR is computed every ∆T iterations.

C ALGORITHMS

In this section, we present the detailed algorithms for both DA and S-DDST. We do not present the algorithm of R-DDST as it is a combination of DA and SDST. Increase the density of discriminator from dD to dD + ∆d using given grow method A.

5:

else if BR is less or equal to B-then 6: Decrease the density of discriminator from dD to dD -∆d using given drop method B. Require: Generator G, discriminator D, total number of iterations T , number of training steps for discriminator in each iteration N , maximal density of discriminator dmax. 1: for t in [1, • • • , T ] do 2: for n in [1, • • • , N ] do 3: Compute the loss of discriminator LD(θD) 4: LD(θD).backward() 5: end for 6: Compute the loss of generator LG(θG) 7: LG(θG).backward() 8: if t is less than 0.5 * T and current density of discriminator dD is less than dmax then 9: Apply DA in Algorithm 1 to D 10: else 11: Apply DST to D 12: end if 13: Apply DST to G 14: end for

D DDST BALANCE RATIO EVOLUTION

In this section, we show that DDST methods are able to maintain a BR throughout training. We show the time evolution of BR and discriminator density for CIFAR-10 and STL-10 datasets. 

E MORE EXPERIMENT RESULTS

In this section, we present IS scores results for Table 1 and Table 3 . The corresponding results are shown in Table 6 and Table 7 , respectively. We also include FID results of CIFAR-10 test set in Table 8 . We compare the following methods under two settings where d max ∈ {100%, 50%}: • Dense training. • static-Strong. • SDST-Strong. • R-DDST. • S-DDST. We choose static-Strong and SDST-Strong as they perform relatively better than their counterparts with the balance strategy. To simplify our calculation, we compute the FLOPs of R-DDST and S-DDST assuming the discriminator density d D = d max . We also assume that DST may not cause the change of FLOPs. The results are shown in Table 9 and Table 10 . It can be seen that the extra computational cost introduced by DAfoot_1 , which computes BR, and RigL, which computes gradient magnitude for connection growth, is negligible compared to the total training cost as they only happen every several hundred iterations.

G A DETAILED COMPARISON OF TRAINING COSTS

In this section, we compute the computational cost of RigL vairants and static baselines more accurately. More specifically, we take into account the density redistribution over different layers in this section. Also, we neglect the computational overhead introduced by computing BR.



Notice that STU-GAN is almost identical to SDST(RigL) with EMA tailored for DST. In our experiment, we compute BR for every iteration to visualize its evolution. However, BR only needs to be calculated for every several hundred iterations to compute the time-averaged BR.



Figure 2: FID (↓) of static sparsely trained SNGAN with and without DA on CIFAR-10 with different sparsity ratio combinations. The shaded areas denote the standard deviation.

we can see that for larger discriminator density d D , the BR is much lower throughout the training. Furthermore, the best density as indicated by Figure 2 has overall BR in the range [0.3, 0.7].

Figure 4: FID (↓) comparison of SDST against static sparse training for SNGAN on CIFAR-10 with different sparsity ratio combinations. The shaded areas denote the standard deviation. Experiment results. Results are shown in Figure 2 with dashed lines. For stronger generators (d G ∈ {30%, 50%, 100%}), both DA-strong and DA-mild are able to identify reasonable discriminator densities. While for weak generators (d G ∈ {10%, 20%}), DA-mild shows a more stable performance. The experiments show the significant benefits brought by BR. Furthermore, they again support our claims that neither overly-strong nor weak discriminators can lead to balanced and successful GAN training.

For a fair comparison, baseline methods can use the discriminator with arbitrary sparsity ratio, i.e., d D ∈ [d min , d max ] = [0%, 100%]. Comparison to STU-GAN (SDST). Compared to STU-GAN (or SDST in general) which predefines the discriminator density before training, the difference is that for R-DDST, the density of the discriminator is adjusted during the training process automatically through DA. Given the initial discriminator density d int D = d G , R-DDST automatically increases the discriminator density if a stronger discriminator is needed, and vice versa.

we can use the discriminator with sparsity ratio d D ∈ [d min , d max ] = [0%, 50%]. Compared with R-DDST, the learning process will be more challenging with the introduced constraints on the maximal discriminator density. S-DDST consists of two phases and works as follows: ➊ Density exploration of the discriminator. During the first phase, S-DDST performs just like R-DDST, with the exception that we apply the constraint d D ≤ d max < 100%. Concretely, S-DDST aims to find a suitable discriminator density d * D with DA algorithm in the first half of training. ➋ Paramter exploration of the discriminator. During the second phase, both the generator and discriminator are adjusted using DST with fixed discriminator density d * D found in the first phase.

DOUBLE DYNAMIC SPARSE TRAINING ALGORITHM Details of S-DDST algorithm is presented in Algorithm 2. Algorithm 2 Strict double dynamic sparse training (S-DDST) for GANs.

Figure 5: Balance ratio and discriminator density evolution during training for R-DDST(RigL) on CIFAR-10. Dashed lines represent BR values of 0.45 and 0.55.

Figure 6: Balance ratio and discriminator density evolution during training for R-DDST(RigL) on STL-10. Dashed lines represent BR values of 0.45 and 0.55.

Figure 7: Balance ratio and discriminator density evolution during training for S-DDST(RigL) on CIFAR-10. Dashed lines represent BR values of 0.45 and 0.55.

Figure 8: Balance ratio and discriminator density evolution during training for S-DDST(RigL) on STL-10. Dashed lines represent BR values of 0.45 and 0.55.

FID (↓) of different sparse training methods on CIFAR-10 and STL-10 datasets with no constraint on the density of the discriminator. Best results are in bold; second-best results are underlined.

FID (↓) and normalized training FLOPs of different sparse training methods with BigGAN on CIFAR-10 dataset. Best results are in bold; second-best results are underlined.

FID (↓) of different sparse training methods on CIFAR-10 and STL-10 datasets. The density of the discriminator is constrained to be lower than 50%. Best results are in bold; second-best results are underlined.

ResNet architecture for CIFAR-10.

ResNet architecture for STL-10. Generator G, discriminator D, DA upper bound B+ and lower bound B-, DA interval ∆TD, density increment ∆d, grow method A, drop method B, iteration t.

IS (higher is better) of different sparse training methods on CIFAR-10 and STL-10 datasets. There is no constraint on the density of the discriminator, i.e., d max = 100%.

IS (higher is better) of different sparse training methods on CIFAR-10 and STL-10 datasets. The density of the discriminator is constrained to be lower than d max = 50%.

FID of test set (↓) of different sparse training methods on CIFAR-10 dataset. Best results are in bold; second-best results are underlined.

Training FLOPs (×10 17 ) of different sparse training methods on CIFAR-10 dataset.

Training FLOPs (×10 17 ) of different sparse training methods on STL-10 dataset. A ROUGH ESTIMATION OF COMPUTATIONAL COSTS ON SNGAN In this section, we provide a very rough estimation on the computational cost of different sparse training methods in terms of training FLOPs. Please see Appendix G for a more accurate comparison. We approximate the number of backward FLOPs with two times the number of forward FLOPs.

A EXPERIMENTAL SETUP

Our code is mainly based on the original code of ITOP (Liu et al., 2021c) and GAN ticket (Chen et al., 2021) .

A.1 ARCHITECTURE DETAILS

We use ResNet-32 (He et al., 2016) for CIFAR-10 dataset and ResNet-48 for STL-10 dataset. See Table 4 and Table 5 for detailed architectures. Spectral normalization is applied for all fullyconnected layers and convolutional layers of the discriminators.For BigGAN architecture, we use the implementation used in Zhao et al. (2020). 2 A.2 DATASETS We use the training set of CIFAR-10 and unlabeled partition of STL-10 for GAN training. Training images are resized to 32 × 32 and 48 × 48 for CIFAR-10 and STL-10 datasets, respectively. Augmentation methods for both datasets are random horizontal flip and per-channel normalization.

A.3 TRAINING HYPERPARAMETERS

SNGAN on the CIFAR-10 and STL-10 datasets. We use a learning rate of 2 × 10 -4 for both generators and discriminators. The discriminator is updated five times for every generator update. We adopt Adam optimizer with β 1 = 0 and β 2 = 0.9. The batch size of the discriminator and the generator is set to be 64 and 128, respectively. Hinge loss is used following Brock et al. (2018) ; Chen et al. (2021) . We use exponential exponential moving average (EMA) (Yaz et al., 2018) with β = 0.999. The generator is trained for a total of 100k iterations.BigGAN on the CIFAR-10 dataset. We use a learning rate of 2 × 10 -4 for both generators and discriminators. The discriminator is updated four times for every generator update. We adopt Adam optimizer with β 1 = 0 and β 2 = 0.999. The batch size of both the discriminator and the generator is set to be 50. Hinge loss is used following Brock et al. (2018) ; Wu et al. (2021) . We use EMA with β = 0.9999. The generator is trained for a total of 200k iterations.

A.4 EVALUATION METRIC

SNGAN on the CIFAR-10 and the STL-10 datasets. We compute Fréchet inception distance (FID) and Inception score (IS) for 50k generated images every 2000 iterations. Best FID and IS are reported. For the CIFAR-10 dataset, we report both FID for the training set and test set, whereas, for the STL-10 dataset, we report the FID of the unlabeled partition.BigGAN on the CIFAR-10 dataset. We compute Fréchet inception distance (FID) and Inception score (IS) for 10k generated images every 5000 iterations. Best FID and IS are reported.

B DYNAMIC SPARSE TRAINING DETAILS B.1 GENERAL DST HYPERPARAMETERS

Following Evci et al. (2020) , we specify the hyper-parameters of DST through sparsity distribution, update schedule, drop criterion, and grow criterion. We explain the details of DST below.Sparsity distribution at initialization. Following Evci et al. (2020) ; Liu et al. (2021c) , only parameters of fully connected layers and convolutional layers will be pruned. At initialization, we use the commonly adopted Erdős-Rényi-Kernel (ERK) strategy (Evci et al., 2020; Dettmers & Zettlemoyer, 2019; Liu et al., 2021c) to allocates higher sparsity to larger layers. Specifically, the sparsity of convolutional layers l is scaled with 1 -n l-1 +n l +w l +h l n l-1 n l w l h l , where n l denotes the number of channels 2 https://github.com/mit-han-lab/data-efficient-gans/tree/master/ DiffAugment-biggan-cifar. In this subsection, we show the results of BigGAN (CIFAR-10). We have included the simplified version in the Table 2 . Here we give more detailed results in Table 12 . The results are similar to SNGAN on the CIFAR dataset.

H ONE-SHOT PRUNING AFTER TRAINING WITHOUT FINE-TUNING

In this section, we perform one-shot pruning after training for GANs without any fine-tuning. The results of SNGANs on the CIFAR-10 and STL-10 datasets are shown in Table 13 .

