GANS CAN PLAY LOTTERY TICKETS TOO

Abstract

Deep generative adversarial networks (GANs) have gained growing popularity in numerous scenarios, while usually suffer from high parameter complexities for resource-constrained real-world applications. However, the compression of GANs has less been explored. A few works show that heuristically applying compression techniques normally leads to unsatisfactory results, due to the notorious training instability of GANs. In parallel, the lottery ticket hypothesis shows prevailing success on discriminative models, in locating sparse matching subnetworks capable of training in isolation to full model performance. In this work, we for the first time study the existence of such trainable matching subnetworks in deep GANs. For a range of GANs, we certainly find matching subnetworks at 67%-74% sparsity. We observe that with or without pruning discriminator has a minor effect on the existence and quality of matching subnetworks, while the initialization weights used in the discriminator plays a significant role. We then show the powerful transferability of these subnetworks to unseen tasks. Furthermore, extensive experimental results demonstrate that our found subnetworks substantially outperform previous state-of-the-art GAN compression approaches in both image generation (e.g.

1. INTRODUCTION

Generative adversarial networks (GANs) have been successfully applied to many fields like image translation (Jing et al., 2019; Isola et al., 2017; Liu & Tuzel, 2016; Shrivastava et al., 2017; Zhu et al., 2017) and image generation (Miyato et al., 2018; Radford et al., 2016; Gulrajani et al., 2017; Arjovsky et al., 2017) . However, they are often heavily parameterized and often require intensive calculation at the training and inference phase. Network compressing techniques (LeCun et al., 1990; Wang et al., 2019; 2020b; Li et al., 2020) can be of help at inference by reducing the number of parameters or usage of memory; nonetheless, they can not save computational burden at no cost. Although they strive to maintain the performance after compressing the model, a non-negligible drop in generative capacity is usually observed. A question is raised: Is there any way to compress a GAN model while preserving or even improving its performance? The lottery ticket hypothesis (LTH) (Frankle & Carbin, 2019) provides positive answers with matching subnetworks (Chen et al., 2020b) . It states that there exist matching subnetworks in dense models that can be trained to reach a comparable test accuracy to the full model within similar training iterations. The hypothesis has successfully shown its success in various fields (Yu et al., 2020; Renda et al., 2020; Chen et al., 2020b) , and its property has been studied widely (Malach et al., 2020; Pensia et al., 2020; Elesedy et al., 2020) . However, it is never introduced to GANs, and therefore the presence of matching subnetworks in generative adversarial networks still remains mysterious. To address this gap in the literature, we investigate the lottery ticket hypothesis in GANs. One most critical challenge of extending LTH in GANs emerges: how to deal with the discriminator while compressing the generator, including (i) whether prunes the discriminator simultaneously and (ii) what initialization should be adopted by discriminators during the re-training? Previous GAN compression methods (Shu et al., 2019; Wang et al., 2019; Li et al., 2020; Wang et al., 2020b) prune the generator model only since they aim at reducing parameters in the inference stage. The effect of pruning the discriminator has never been studied by these works, which is unnecessary for them but possibly essential in finding matching subnetworks. It is because that finding matching subnetworks involves re-training the whole GAN network, in which an imbalance in generative and discriminative power could result in degraded training results. For the same reason, the disequilibrium between initialization used in generators and discriminators incurs severe training instability and unsatisfactory results. Another attractive property of LTH is the powerful transferability of located matching subnetworks. Although it has been well studied in discriminative models (Mehta, 2019; Morcos et al., 2019; Chen et al., 2020b) , an in-depth understanding of transfer learning in GAN tickets is still missing. In this work, we not only show whether the sparse matching subnetworks in GANs can transfer across multiple datasets but also study what initialization benefits more to the transferability. To convert parameter efficiency of LTH into the advantage of computational saving, we also utilize channel pruning (He et al., 2017) to find the structural matching subnetworks of GANs, which enjoys the bonus of accelerated training and inference. Our contributions can be summarized in the following four aspects: • Using unstructured magnitude pruning, we identify matching subnetworks at 74% sparsity in SNGAN (Miyato et al., 2018) and 67% in CycleGAN (Zhu et al., 2017) . The matching subnetworks in GANs exist no matter whether pruning discriminators, while the initialization weights used in the discriminator are crucial. • We show that the matching subnetworks found by iterative magnitude pruning outperform subnetworks extracted by randomly pruning and random initialization in terms of extreme sparsity and performance. To fully exploit the trained discriminator, we using the dense discriminator as a distillation source and further improve the quality of winning tickets. • We demonstrate that the found subnetworks in GANs transfer well across diverse generative tasks. • The matching subnetworks found by channel pruning surpass previous state-of-the-art GAN compression methods (i.e., GAN Slimming (Wang et al., 2020b) ) in both efficiency and performance.

2. RELATED WORK

GAN Compression Generative adversarial networks (GANs) have succeeded in computer vision fields, for example, image generation and translation. One significant drawback of the generative models is the high computational cost of the models' complex structure. A wide range of neural network compression techniques has been applied to generative models to address this problem. There are several categories of compression techniques, including pruning (removing some parameters), quantization (reducing the bit width), and distillation. Shu et al. (2019) proposed a channel pruning method for CycleGAN by using a co-evolution algorithm. Wang et al. (2019) proposed a quantization method for GANs based on the EM algorithm. Li et al. (2020) used a distillation method to transfer knowledge of the dense to the compressed model. Recently Wang et al. (2020b) proposed a GAN compression framework, GAN slimming, that integrated the above three mainstream compression techniques into a unified form. Previous works on GAN pruning usually aim at finding a sparse structure of the trained generator model for faster inference speed, while we are focusing on finding trainable structures of GANs following the lottery ticket hypothesis. Moreover, in existing GAN compression methods, only the generator is pruned, which could undermine the performance of re-training since the left-out discriminator may have a stronger computational ability than the pruned generator and therefore cause a degraded result due to the imparity of these two models. The Lottery Ticket Hypothesis The lottery ticket hypothesis (LTH) (Frankle & Carbin, 2019) claims the existence of sparse, separate trainable sub-networks in a dense network. These subnetworks are capable of reaching comparable or even better performance than full dense model, which has been evidenced in various fields, such as image classification (Frankle & Carbin, 2019; Liu et al., 2019; Wang et al., 2020a; Evci et al., 2019; Frankle et al., 2020; Savarese et al., 2020; Yin et al., 2020; You et al., 2020; Ma et al., 2021; Chen et al., 2020a) , natural language processing (Gale et al., 2019; Chen et al., 2020b) , reinforcement learning (Yu et al., 2020) , lifelong learning (Chen et al., 2021b) , graph neural networks (Chen et al., 2021a) , and adversarial robustness (Cosentino et al., 2019) . Most works of LTH use unstructured weight magnitude pruning (Han et al., 2016; Frankle & Carbin, 2019) to find the matching subnetworks, and the channel pruning is also adopted in a recent work (You et al., 2020) . In order to scale up LTH to larger networks and datasets, the "late rewinding" technique is proposed by Frankle et al. (2019) ; Renda et al. (2020) . Mehta (2019) ; Morcos et al. (2019) ; Desai et al. (2019) are the pioneers to study the transferability of found subnetworks. However, all previous works focus on discriminative models. In this paper, we extend LTH to GANs and reveal unique findings of GAN tickets.

3. PRELIMINARIES

In this section, we describe our pruning algorithms and list related experimental settings. Backbone Networks We use two GANs in our experiments in Section 4: SNGAN (Miyato et al., 2018) and CycleGAN (Zhu et al., 2017) ). SNGAN with ResNet (He et al., 2016) et al., 2009) as the benchmark. For the transfer study, the experiments are conducted on CIFAR-10 and STL-10 ( Coates et al., 2011) . For better transferring, we resize the image in STL-10 to 32 × 32. Subnetworks For a network f (•; θ) parameterized by θ, a subnetwork is defined as f (•;m θ), where m ∈ {0, 1} θ 0 is a pruning mask for θ ∈ R θ 0 and is the element-wise product. For GANs, two separate masks, m d and m g , are needed for both the generator and the discriminator. Consequently, a subnetwork of GANs is consistent of: a sparse generator g(•; m g θ g ) and a sparse discriminator d(•; m d θ d ). Let θ 0 be the initialization weights of model f and θ t be the weights at training step t. Following Frankle et al. (2019) , we define a matching network as a subnetwork f (•; m θ), where θ is initialized with θ t , that can reach the comparable performance to the full network within a similar training iterations when trained in isolation; a winning ticket is defined as a matching subnetwork where t = 0, i.e. θ initialized with θ 0 . Finding subnetworks Finding GAN subnetworks is to find two masks m g and m d for the generator and the discriminator. We use both an unstructured magnitude method, i.e. the iterative magnitude pruning (IMP), and a structured pruning method, i.e. the channel pruning (He et al., 2017) , to generate the masks. For unstructured pruning, we follow the following steps. After we finish training the full GAN model for N iterations, we prune the weights with the lowest magnitude globally (Han et al., 2016) to obtain masks m = (m g , m d ), where the position of a remaining weight in m is marked as one, and the position of a pruned weight is marked as zero. The weights of the sparse generator and the sparse discriminator are then reset to the initial weights of the full network. Previous works have shown that the iterative magnitude pruning (IMP) method is better than the one-shot pruning method. So rather than pruning the network only once to reach the desired sparsity, we prune a certain amount of non-zero parameters and re-train the network several times to meet the requirement. Details of this algorithm are in Appendix A1.1, Algorithm 1. As for channel pruning, the first step is to train the full model as well. Besides using a normal loss function L GAN , we follow Liu et al. (2017) to apply a 1 -norm on the trainable scale parameters γ in the normalization layers to encourage channel-level sparsity: L cp = ||γ|| 1 . To prevent the compressed network behave severely differently with the original large network, we introduce a distillation loss as Wang et al. (2020b) did: L dist = E z [dist(g(z; θ g ), g(z; m g θ g ))] . We train the GAN network with these two additional losses for N 1 epochs and get the sparse networks g(•; m g θ g ) and d(•; m d θ g ). Details of this algorithm are in Appendix A1.1, Algorithm 2. Evaluation of subnetworks After obtaining the subnetworks g(•; θ g m g ) and d(•; θ d m d ), we test whether the subnetworks are matching or not. We reset the weights to a specific step i, and train the subnetworks for N iterations and evaluate them using two specific metrics, Inception Score (Salimans et al., 2016) and Fréchet Inception Distance (Heusel et al., 2017) . Other Pruning Methods We compare the size and the performance of subnetworks found by IMP with subnetworks found by other techniques that aim at compressing the network after training to reduce computational costs at inference. We use a benchmark pruning approach named Standard Pruning (Chen et al., 2020b; Han et al., 2016) , which iteratively prune the 20% of lowest magnitude weights, and train the network for another N iterations without any rewinding, and repeat until we have reached the target sparsity. In order to verify that the statement of iterative magnitude pruning is better than one-shot pruning, we compare IMP G and IMP GD with their one-shot counterparts. Additionally, we compare IMP with some randomly pruning techniques to prove the effectiveness of IMP. They are: 1) Randomly Pruning: Randomly generate a sparsity mask m . 2) Random Tickets: Rewinding the weights to another initialization θ 0 .

4. THE EXISTENCE OF WINNING TICKETS IN GAN

In this section, we will validate the existence of winning tickets in GANs with initialization θ 0 := (θ g0 , θ d0 ). Specifically, we will empirically prove several important properties of the tickets by authenticating the following four claims: Claim 1: Iterative Magnitude Pruning (IMP) finds winning tickets in GANs, g(•; m g θ g0 ) and d(•; m d θ d0 ). Channel pruning is also able to find winning tickets as well. Claim 2: Whether pruning the discriminator D does not change the existence of winning tickets. It is the initialization used in D that matters. Moreover, pruning the discriminator has a slight boost of matching networks in terms of extreme sparsity and performance. Claim 3: IMP finds winning tickets at sparsity where some other pruning methods (randomly pruning, one-shot magnitude pruning, and random tickets) are not matching. Claim 4: The late rewinding technique (Frankle et al., 2019) helps. Matching subnetworks that are initialized to θ i , i.e., i steps from θ 0 , can outperform those initialized to θ 0 . Moreover, matching subnetworks that are late rewound can be trained to match the performance of standard pruning. Claim 1: Are there winning tickets in GANs? To answer this question, we first conduct experiments on SNGAN by pruning the generator only in the following steps: 1) Run IMP to get sequential sparsity masks (m di , m gi ) of sparsity s i % remaining weights; 2) Apply the masks to the GAN and reset the weights of the subnetworks to the same random initialization θ 0 ; 3) Train models to evaluate whether they are winning tickets. 1 We set s i % = (1 -0.8 i ) × 100%, which we use for all the experiments that involve iteratively pruning hereinafter. The number of training epochs for subnetworks is identical to that of training the full models. Figure 1 verifies the existence of winning tickets in SNGAN and CycleGAN. We are able to find winning tickets by iterative pruning the generators at the highest sparsity, around 74% in SNGAN, and around 67% in CycleGAN, where the FID scores of these subnetworks successfully match the FID scores of the full network respectively. The confidence interval also suggests that the winning tickets at some sparsities are statistically significantly better than the full model. To show that channel pruning can find winning tickets as well, we extract several subnetworks from the trained full SNGAN and CycleGAN by varying ρ in Algorithm 2. We define the channel-pruned model's sparsity as the ratio of MFLOPs between the sparse model and the full model. We can confirm that winning tickets can also be found by channel pruning (CP). CP is able to find winning tickets in SNGAN at sparsity around 34%. We will analyze it more carefully in Section 7. Claim 2: Does the treatment of the discriminator affect the existence of winning tickets? Previous works of GAN pruning did not analyze the effect of pruning the discriminator. To study the effect, we compare two different iterative pruning settings: 1) Prune the generator only (IMP G ) and 2) Prune both the generator and the discriminator iteratively (IMP GD ). Both the generator and the discriminator are reset to the same random initialization θ 0 after the masks are obtained. The FID scores of the two experiments are shown in Figure 4 . The graph suggests that the two settings share similar patterns: the minimal FID of IMP G is 14.19, and the minimal FID of IMP GD is 14.59. The difference between these two best FID is only 0.4, showing a slight difference in generative power. The FID curve of IMP G lies below that of IMP GD at low sparsity but lies above at high sparsity, indicating that pruning the discriminator produces slightly better performance when the percent of remaining weights is small. The extreme sparsity where IMP GD can match the performance of the full model is 73.8%. In contrast, IMP G can only match no sparser than 67.2%, demonstrating that pruning the discriminator can also push the frontier of extreme sparsity where the pruned models are still able to match. In addition, we study the effects of different initialization for the sparse discriminator. We compare different weights loading methods when applying the iterative magnitude pruning process: 1) Reset the weights of generator to θ g0 and reset the weights of discriminator to θ d0 , which is identical to IMP G ; 2) Reset the weights of generator to θ g0 and fine-tune the discriminator, which we will call IMP F G . Figure 4 shows that resetting both the weights to θ 0 produces a much better result than only resetting the generator. The discriminator D without resetting its weights is too strong for the generator G with initial weights θ g0 that will lead to degraded performance. In summary, different initialization of the discriminator will significantly influence the existence and quality of winning tickets in GAN models. A follow-up question arises from the previous observations: is there any way to use the weights of the dense discriminator, as the direct usage of the dense weights yields inferior results? One possible way is to use it as a "teacher" and transfer the knowledge to pruned discriminator using a consistency loss. Formally speaking, an additional regularization term is used when training the whole network: L KD (x; θ d , m d ) = E x [KL Div (d(x; m d θ d ), d(x; θ d1 ))] where KL Div denotes the KL-Divergence. We name the iterative pruning method with this additional regularization IMP KD GD . Figure 5 shows the result of pruning method IMP KD GD compared to the previous two pruning methods we proposed, IMP G and IMP GD . IMP KD GD is capable of finding winning tickets at sparsity around 70%, outperforming IMP G , and showing comparable results to IMP GD regarding the extreme sparsity. The FID curve of setting IMP KD GD is further mostly located below the curve of IMP GD , demonstrating a stronger generative ability than IMP GD . It suggests transferring knowledge from the full discriminator benefits to find the winning tickets. Claim 3: Can IMP find matching subnetworks sparser than other pruning methods? Previous works claim that both a specific sparsity mask and a specific initialization are necessary for finding winning tickets (Frankle & Carbin, 2019) , and iterative magnitude pruning is better than one-shot pruning. To extend such a statement in the context of GANs, we compare IMP with several other benchmarks, randomly pruning (RP), one-shot magnitude pruning (OMP), and random tickets (RT), to see if IMP can find matching networks at higher sparsity. Figure 6 and Table 1 show that iterative magnitude pruning outperforms one-shot magnitude pruning no matter pruning the discriminator or not. IMP finds winning tickets at higher sparsity (67.23% and 73.79%, respectively) than one-shot pruning (59.00% and 48.80%, respectively). The minimal FID score of subnetworks found by IMP is smaller than that of subnetworks found by OMP as well. This observation defends the statement that pruning iteratively is superior compared to one-shot pruning. We also list the minimal FID scores and extreme sparsity of matching networks for other pruning methods in Figure 7 and Table 2 . It can be seen that IMP finds winning tickets at sparsity where some other pruning methods, randomly pruning and random initialization, cannot match. Since IMP G shows the best overall result, we authenticate the previous statement that both the specific sparsity mask and the specific initialization are essential for finding winning tickets. Claim 4: Does rewinding improve performance? In previous paragraphs, we show that we are able to find winning tickets in both SNGAN and CycleGAN. However, these subnetworks cannot match the performance of the original network at extremely high sparsity, while the subnetworks found by standard pruning can (Table 3 ). To find matching subnetworks at such high sparsity, we adopt the rewinding paradigm: after the masks are obtained, the weights of the model are rewound to θ i , the weights after i steps of training, rather than reset to the same random initialization θ 0 . It was pointed out by Renda et al. (2020) that subnetworks found by IMP and rewound early in training can be trained to achieve the same accuracy at the same sparsity as subnetworks found by the standard pruning, providing a possibility that rewinding can also help GAN subnetworks. We choose different rewinding settings: 5%, 10%, and 20% of the whole training epochs. The results are shown in Table 3 . We observe that rewinding can significantly increase the extreme sparsity of matching networks. Rewinding to even only 5% of the training process can raise the extreme sparsity from 67.23% to 86.26%, and rewinding to 20% can match the performance of standard pruning. We also compare the FID score of subnetworks found at 89% sparsity. Rewind to 20% of the training process can match the performance of standard pruning at 89% sparsity, and other late rewinding settings can match the performance of the full model. This suggests that late rewinding techniques can greatly contribute to matching subnetworks with higher sparsity. Summary Extensive experiments are conducted to examine the existence of matching subnetworks in generative adversarial models. We confirmed that there were matching subnetworks at high sparsities, and both the sparsity mask and the initialization matter for finding winning tickets. We also studied the effect of pruning the discriminator and demonstrate that pruning the discriminator can slightly boost the performance regarding the extreme sparsity and the minimal FID. We proposed a method to utilize the weights of the dense discriminator model to boost the performance further. We also compare IMP with different pruning methods, showing that IMP is superior to random tickets and random pruning. In addition, late rewinding can match the performance of standard pruning, which again shows consistency with previous works.

5. THE TRANSFER LEARNING OF GAN MATCHING NETWORKS

In the previous section, we confirm the presence of winning tickets in GANs. In this section, we will study the transferability of winning tickets. Existing works (Mehta, 2019) show that the matching networks in discriminative models can transfer across tasks. Here we evaluate this claim in GANs. To investigate the transferability, we propose three transfer experiments from CIFAR-10 to STL-10 on SNGAN. We first identify matching subnetworks g(•, m g θ) and d(•, m d θ) on CIFAR-10, and then train and evaluate the subnetworks on STL-10. To assess whether the same random initialization θ 0 is needed for transferring, we test three different weights loading method: 1) reset the weights to θ 0 ; 2) reset the weights to another initialization θ 0 ; 3) rewind the weights to θ N . We train the network on STL-10 using the same hyper-parameters as on CIFAR-10. It is noteworthy that the hyper-parameters setting might not be optimal for the target task, yet it is fair to compare different transferring settings. The FID score of different settings is shown in Table 4 . Subnetworks initialized by θ 0 and using masks generated by IMP G can be trained to achieve comparable results to the baseline model. Surprisingly, random re-initialization θ 0 shows better transferability than using the same initialization θ 0 in our transfer settings and outperforms the full model trained on STL-10, indicating that the combination of θ 0 and the mask generated by IMP GD is more focused on the source dataset and consequently has lower transferability. Summary In this section, we tested the transferability of IMP subnetworks. Transferring from θ 0 and θ r both produce matching results on the target dataset, STL-10. θ 0 works better with masks generated by IMP G while the masks generated by IMP GD prefer a different initialization θ r . Given that IMP GD performs better on CIFAR-10, it is reasonable that the same initialization θ 0 has lower transferability when using masks from IMP GD .

6. EXPERIMENTS ON OTHER GAN MODELS AND OTHER DATASETS

We conducted experiments on DCGAN (Radford et al., 2016) , WGAN-GP (Gulrajani et al., 2017) , ACGAN (Odena et al., 2017) , GGAN (Lim & Ye, 2017), DiffAugGAN (Zhao et al., 2020a) , Pro-jGAN (Miyato & Koyama, 2018) , SAGAN (Zhang et al., 2019) , as well as a NAS-based GAN, AutoGAN (Gong et al., 2019) . We use CIFAR-10 and Tiny ImageNet (Wu et al., 2017) as our benchmark datasets. Table 5 and 6 consistently verify that the existence of winning tickets in diverse GAN architectures in spite of the different extreme sparsities, showing that the lottery ticket hypothesis can be generalized to various GAN models.

7. EFFICIENCY OF GAN WINNING TICKETS

Unlike the unstructured magnitude pruning method, channel pruning can reduce the number of parameters in GANs. Therefore, winning tickets found by channel pruning are more efficient than the original model regarding computational cost. To fully exploit the advantage of subnetworks founded by structural pruning, we further compare our prune-and-train pipeline with a state-of-the-art GAN compression framework (Wang et al., 2020b) . The pipeline is described as follows: after extracting the sparse structure generated by channel pruning, we reset the model weights to the same random initialization θ 0 and then train for the same number of epochs as the dense model used. In this paper, the lottery ticket hypothesis has been extended to GANs. We successfully identify winning tickets in GANs, which are separately trainable to match the full dense GAN performance. Pruning the discriminator, which is rarely studied before, had only slight effects on the ticket finding process, while the initialization used in the discriminator is essential. We also demonstrate that the winning tickets found can transfer across diverse tasks. Moreover, we provide a new way of finding winning tickets that alter the structure of models. Channel pruning is able to extract matching subnetworks from a dense model that can outperform the current state-of-the-art GAN compression after resetting the weights and re-training.

ACKNOWLEDGEMENT

Zhenyu Zhang is supported by the National Natural Science Foundation of China under grand No.U19B2044.



The detailed description of algorithms are listed in Appendix A1.1.



Figure 1: The Fréchet Inception Distance (FID) curve of subnetworks of SNGAN (left) and CycleGAN (right) generated by iterative magnitude pruning (IMP) on CIFAR-10 and horse2zebra. The dashed line indicates the FID score of the full model on CIFAR-10 and horse2zebra. The 95% confidence interval of 5 runs is reported.

Figure 2: Visualization by sampling and interpolation of SNGAN Winning Tickets found by IMP. Sparsity of best winning tickets : 48.80%. Extreme sparsity of matching subnetworks: 73.79%.Source Image Full Model (Sparsity: 0%) Best Winning Tickets (Sparsity: 59.04%) Matching Subnetworks (Sparsity: 67.24%) Source Image Full Model (Sparsity: 0%) Best Winning Tickets (Sparsity: 59.04%) Matching Subnetworks (Sparsity: 67.24%)

Figure 4: The FID score of Left: The FID score of best subnetworks generated by two different pruning settings: IMPG and IMPGD. Right: The FID score of best subnetworks generated by two different pruning settings: IMPG and IMP FG . IMPG: iteratively prune and reset the generator. IMPGD: iteratively prune and reset the generator and the discriminator. IMP F G : iteratively prune and reset the generator, and iteratively prune but not reset the discriminator.

Figure 5: The FID curve of best subnetworks generated by three different pruning methods: IMPG, IMPGD and IMP KD GD . IMP KD GD : iteratively prune and reset both the generator and the discriminator, and train them with the KD regularization.

Figure 6: FID curves: OMPG, IMPG, OMPGD and IMPGD. OMPG: one-shot prune the generator. OMPGP: one-shot prune generator/discriminator.

Figure 7: The FID curve of best subnetworks generated by three different pruning settings: IMPG, RP, and RT. RP: iteratively randomly prune the generator. RT: iteratively prune the generator but reset the weights randomly.

Figure8: Relationship between the best IS score of SNGAN subnetworks generated by channel pruning and the percent of remaining weights. GS-32: GAN Slimming without quantization(Wang et al., 2020b). Full Model: Full model trained on CIFAR-10.

is one of the most popular noise-to-image GAN network and has strong performance on several datasets like CIFAR-10. CycleGAN is a popular and well-studied image-to-image GAN network that also performs well on several benchmarks. For SNGAN, let g(z; θ g ) be the output of the generator network G with parameters θ g and a latent variable z ∈ R z 0 and d(x; θ d ) be the output of the discriminator network D with parameters θ d and input example x. For CycleGAN which is composed of two generator-discriminator pairs, we use g(x; θ g ) and θ g again to represent the output and the weights of the two generators where x = (x 1 , x 2 ) indicates a pair of input examples. The same modification can be done for the two discriminators in CycleGAN.

The extreme sparsity of the matching net-

The FID score of best subnetworks and the extreme sparsity of matching networks found by Random Pruning, Random Rickets, and iterative magnitude pruning.

Rewinding results. SExtreme: Extreme sparsity where matching subnetworks exist. FIDBest: The minimal FID score of all subnetworks. FID 89% : The FID score of subnetworks at 89% sparsity.Setting of rewinding S Extreme FID Best FID 89%

Results of late rewinding experiments. θ0: train the target model from the same random initialization as the source model; θr: train from random initialization; θ Best : train from the weights of trained source model. Baseline: full model trained on STL-10. Model Baseline IMP G (S = 67.23%) IMP GD (S = 73.79%) IMP KD GD (S = 73.79%) Metrics FID Best FID Best Matching? FID Best Matching? FID Best Matching?

Results on other GAN models on CIFAR-10. FID Full : FID score of the full model. FID Best : The minimal FID score of all subnetworks. FID Extreme : The FID score of matching networks at extreme sparsity level. AutoGAN-A/B/C are three representative GAN architectures represented in the official repository (https://github.com/VITA-Group/AutoGAN)

Results on other GAN models on Tiny ImageNet. FID Full : FID score of the full model. FID Best : The minimal FID score of all subnetworks. FID Extreme : The FID score of matching networks at extreme sparsity level.We can see from Figure8that subnetworks are founded by the channel pruning method at about 67% sparsity, which provides a new path for winning tickets other than magnitude pruning. The matching networks at 67.7% outperform GS-32 regarding Inception Score by 0.25; the subnetworks at about 29% sparsity can outperform GS-32 by 0.20, setting up a new benchmark for GAN compressions.

A1 MORE TECHNICAL DETAILS A1.1 ALGORITHMS

In this section, we describe the details of the algorithm we used in finding lottery tickets. Two distinct pruning methods are used in Algorithm 1 and Algorithm 2. A7 : The F 8 and F 1/8 score of the full network, best subnetworks and the matching networks at extreme sparsity. We used the official codes to calculate recall, precision, F 8 and F 1/8 .

A2 MORE EXPERIMENTS RESULTS AND ANALYSIS

We will provide extra experiments results and analysis in this section.

A2.1 MORE VISUALIZATION OF IMP WINNING TICKETS

We also conducted experiments to find winning tickets in CycleGAN on dataset win-ter2summer (Zhu et al., 2017) . We observed similar patterns and found matching networks at 79.02% sparsity. We randomly sample four images from the dataset and show the translated images in Figure A10 , Figure A11 , and Figure A12 . The winning tickets of CycleGAN can generate comparable visual quality to the full model under all cases. 

A2.3 CHANNEL PRUNING FOR SNGAN

We study the relationship between the Inception Score and the remaining model size, i.e. the ratio between the size of a channel-pruned model and its original model. The results are plotted in Fig- ure A13 . A similar conclusion can be drawn from the graph that matching networks exist, and at the same sparsity, the matching networks can be trained to outperform the current state-of-the-art GAN compression framework.

A2.4 CHANNEL PRUNING FOR CYCLEGAN

We also conducted experiments on CycleGAN using channel pruning. The task we choose is horseto-zebra, i.e., we prune each of the two generators separately, which is aligned with SNGAN, which has only one generator. We prove that channel pruning is also capable of finding winning tickets in CycleGAN in Figure A14 . Moreover, at extreme sparsity, the sparse subnetwork that we obtain can be trained to reach slightly better results than the current state-of-the-art GAN compression framework without quantization. 

