DOUBLE DYNAMIC SPARSE TRAINING FOR GANS

Abstract

The past decade has witnessed a drastic increase in modern deep neural networks (DNNs) size, especially for generative adversarial networks (GANs). Since GANs usually suffer from high computational complexity, researchers have shown an increased interest in applying pruning methods to reduce the training and inference costs of GANs. Among different pruning methods invented for supervised learning, dynamic sparse training (DST) has gained increasing attention recently as it enjoys excellent training efficiency with comparable performance to post-hoc pruning. Hence, applying DST on GANs, where we train a sparse GAN with a fixed parameter count throughout training, seems to be a good candidate for reducing GAN training costs. However, a few challenges, including the degrading training instability, emerge due to the adversarial nature of GANs. Hence, we introduce a quantity called balance ratio (BR) to quantify the balance of the generator and the discriminator. We conduct a series of experiments to show the importance of BR in understanding sparse GAN training. Building upon single dynamic sparse training (SDST), where only the generator is adjusted during training, we propose double dynamic sparse training (DDST) to control the BR during GAN training. Empirically, DDST automatically determines the density of the discriminator and greatly boosts the performance of sparse GANs on multiple datasets.

1. INTRODUCTION

In the past decade, the training and inference costs of modern deep neural networks (DNNs) are gradually becoming prohibitive (He et al., 2016; Dosovitskiy et al., 2020; Liu et al., 2021d) , especially for large language models (Brown et al., 2020) . Among all these large models, generative adversarial networks (GANs) (Goodfellow et al., 2020) have been widely investigated for years and achieved remarkable results. However, similar to other giant models, GANs are notably computationally intensive. For example, BigGAN (Brock et al., 2018) trained on 8 NVIDIA V100 GPUs with full precision will take 15 days. Consequently, to train GANs in broader resource-constrained scenarios, this computational bottleneck of training needs to be resolved urgently. Neural network pruning has recently emerged as a powerful tool to reduce training and inference costs of DNNs for supervised learning. There are mainly three genres of pruning methods, namely pruning-at-initialization, pruning-during-training, and post-hoc pruning methods. Post-hoc pruning (Janowsky, 1989; LeCun et al., 1989; Han et al., 2015) can date back to the 1980s, which was first introduced for reducing inference time and memory requirements; hence does not align with our purpose of efficient training. Later, pruning-at-initialization (Lee et al., 2018; Wang et al., 2020a; Tanaka et al., 2020) and pruning-during-training methods (Louizos et al., 2017; Wen et al., 2016) were introduced to prune the networks before training and throughout the training, respectively. Most early pruning-during-training algorithms (Savarese et al., 2020) gradually decrease the density of the neural networks and hence do not bring much training efficiency compared to posthoc pruning. However, recent advances in dynamic sparse training (DST) (Evci et al., 2020; Liu et al., 2021a; b; c; Mocanu et al., 2018) for the first time show that pruning-during-training methods can have comparable training FLOPs as pruning-at-initialization methods while having competing performance with respect to post-hoc pruning. Therefore, applying DST on GANs seems to be a promising choice. Although DST has attained remarkable achievements in supervised learning, the application of DST on GANs is less explored due to newly emerged challenges. The main difficulty stems from the fact that the training procedure of GANs is notoriously brittle. To ensure successful training, we 1

