DOUBLE DYNAMIC SPARSE TRAINING FOR GANS

Abstract

The past decade has witnessed a drastic increase in modern deep neural networks (DNNs) size, especially for generative adversarial networks (GANs). Since GANs usually suffer from high computational complexity, researchers have shown an increased interest in applying pruning methods to reduce the training and inference costs of GANs. Among different pruning methods invented for supervised learning, dynamic sparse training (DST) has gained increasing attention recently as it enjoys excellent training efficiency with comparable performance to post-hoc pruning. Hence, applying DST on GANs, where we train a sparse GAN with a fixed parameter count throughout training, seems to be a good candidate for reducing GAN training costs. However, a few challenges, including the degrading training instability, emerge due to the adversarial nature of GANs. Hence, we introduce a quantity called balance ratio (BR) to quantify the balance of the generator and the discriminator. We conduct a series of experiments to show the importance of BR in understanding sparse GAN training. Building upon single dynamic sparse training (SDST), where only the generator is adjusted during training, we propose double dynamic sparse training (DDST) to control the BR during GAN training. Empirically, DDST automatically determines the density of the discriminator and greatly boosts the performance of sparse GANs on multiple datasets.

1. INTRODUCTION

In the past decade, the training and inference costs of modern deep neural networks (DNNs) are gradually becoming prohibitive (He et al., 2016; Dosovitskiy et al., 2020; Liu et al., 2021d) , especially for large language models (Brown et al., 2020) . Among all these large models, generative adversarial networks (GANs) (Goodfellow et al., 2020) have been widely investigated for years and achieved remarkable results. However, similar to other giant models, GANs are notably computationally intensive. For example, BigGAN (Brock et al., 2018) trained on 8 NVIDIA V100 GPUs with full precision will take 15 days. Consequently, to train GANs in broader resource-constrained scenarios, this computational bottleneck of training needs to be resolved urgently. Neural network pruning has recently emerged as a powerful tool to reduce training and inference costs of DNNs for supervised learning. There are mainly three genres of pruning methods, namely pruning-at-initialization, pruning-during-training, and post-hoc pruning methods. Post-hoc pruning (Janowsky, 1989; LeCun et al., 1989; Han et al., 2015) can date back to the 1980s, which was first introduced for reducing inference time and memory requirements; hence does not align with our purpose of efficient training. Later, pruning-at-initialization (Lee et al., 2018; Wang et al., 2020a; Tanaka et al., 2020) and pruning-during-training methods (Louizos et al., 2017; Wen et al., 2016) were introduced to prune the networks before training and throughout the training, respectively. Most early pruning-during-training algorithms (Savarese et al., 2020) gradually decrease the density of the neural networks and hence do not bring much training efficiency compared to posthoc pruning. However, recent advances in dynamic sparse training (DST) (Evci et al., 2020; Liu et al., 2021a; b; c; Mocanu et al., 2018) for the first time show that pruning-during-training methods can have comparable training FLOPs as pruning-at-initialization methods while having competing performance with respect to post-hoc pruning. Therefore, applying DST on GANs seems to be a promising choice. Although DST has attained remarkable achievements in supervised learning, the application of DST on GANs is less explored due to newly emerged challenges. The main difficulty stems from the fact that the training procedure of GANs is notoriously brittle. To ensure successful training, we usually need carefully chosen architectures and finely-tuned hyper-parameters. One possible cause is the difficulty of balancing the generator and the discriminator throughout training (Bai et al., 2018; Arora et al., 2017) . Specifically, an overly-strong discriminator will lead to overfitting, while a weak discriminator will result in mode collapse. As a consequence, the requirement of balanced training brings even more challenges to sparse GAN training. On the one hand, we find that performance degradation caused by the unbalance of GANs is even more severe when sparsity is introduced. On the other hand, for directly applying DST to the generator (or both) like the pioneering work STU-GAN (Liu et al., 2022) , it is unclear how to determine a reasonable density of the discriminator. To this end, we propose a metric called balance ratio (BR), which measures the degree of balance of the two components, to study sparse GAN training. We find that BR is useful in (1) understanding the interaction between the discriminator and the generator, (2) identifying the cause of training failure, and (3) helping stabilize sparse GAN training as an indicator. To our best knowledge, this is the first study to investigate the balance of sparse GANs and may even provide new insights into dense GAN training. Using BR as an indicator, we further propose double dynamic sparse training (DDST) to adjust the density and the connections of the discriminator automatically during training. Our contributions are summarized below: • We introduce a quantity named balance ratio to quantify the degree of balance in GAN training, which also helps understand the cause of some training failure cases. • We first consider single dynamic sparse training (SDST), which is a generalization of STU-GAN (Liu et al., 2022) : applying DST to only the generator with varying discriminator density ratios. We show that SDST does not necessarily outperform the static sparse training baselines. • We provide two strategies to determine the discriminator density for SDST, and we find that using a relatively larger density usually generates stable and better performance. • Using the balance ratio as an indicator, we propose double dynamic sparse training (DDST), which makes the discriminator dynamic both in density level and parameter level. Empirically, DDST outperforms baselines with reasonable computational cost on several datasets.

2.1. NEURAL NETWORK PRUNING

Based on the smallest granularity of pruned units, neural network pruning can be categorized into structured (Liu et al., 2017; 2018; Huang & Wang, 2018; Luo et al., 2017) and unstructured pruning (Frankle & Carbin, 2018; Han et al., 2015) . In this work, we mainly focus on unstructured pruning where individual weight is the finest resolution. Post-hoc pruning. Post-hoc pruning prunes weights of a fully-trained neural network, and they usually have high computation cost due to the multiple rounds of train-prune-retrain procedure (Han et al., 2015; Renda et al., 2020) . Some use specific criteria (Han et al., 2015; LeCun et al., 1989; Hassibi et al., 1993; Molchanov et al., 2019; Dai et al., 2018; Guo et al., 2016; Dong et al., 2017; Yu et al., 2018) 



Pruning-at-initialization methods. SNIP(Lee et al., 2018)  is one of the pioneering works which aim to find trainable sub-networks without any training. Some following works(Wang et al., 2020a;  Tanaka et al., 2020; de Jorge et al., 2020; Alizadeh et al., 2022)  aim to propose different metrics to prune networks at initialization. Among them, Synflow(Tanaka et al., 2020),SPP (Lee et al., 2019),  and FORCE (de Jorge et al., 2020)  try to address the problem of layer collapse during pruning. Neural tangent transfer (Liu & Zenke, 2020) learns a sub-network by aligning the empirical neural tangent kernel and network output to the dense counterpart.

