TOWARDS FASTER AND STABILIZED GAN TRAINING FOR HIGH-FIDELITY FEW-SHOT IMAGE SYNTHESIS

Abstract

Training Generative Adversarial Networks (GAN) on high-fidelity images usually requires large-scale GPU-clusters and a vast number of training images. In this paper, we study the few-shot image synthesis task for GAN with minimum computing cost. We propose a light-weight GAN structure that gains superior quality on 1024 × 1024 resolution. Notably, the model converges from scratch with just a few hours of training on a single RTX-2080 GPU, and has a consistent performance, even with less than 100 training samples. Two technique designs constitute our work, a skip-layer channel-wise excitation module and a self-supervised discriminator trained as a feature-encoder. With thirteen datasets covering a wide variety of image domains 1 , we show our model's superior performance compared to the state-of-the-art StyleGAN2, when data and computing budget are limited.

1. INTRODUCTION

The fascinating ability to synthesize images using the state-of-the-art (SOTA) Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) display a great potential of GANs for many intriguing real-life applications, such as image translation, photo editing, and artistic creation. However, expensive computing cost and the vast amount of required training data limit these SOTAs in real applications with only small image sets and low computing budgets. In real-life scenarios, the available samples to train a GAN can be minimal, such as the medical images of a rare disease, a particular celebrity's portrait set, and a specific artist's artworks. Transferlearning with a pre-trained model (Mo et al., 2020; Wang et al., 2020) is one solution for the lack of training images. Nevertheless, there is no guarantee to find a compatible pre-training dataset. Furthermore, if not, fine-tuning probably leads to even worse performance (Zhao et al., 2020) . In a recent study, it was highlighted that in art creation applications, most artists prefers to train their models from scratch based on their own images to avoid biases from fine-tuned pre-trained model. Moreover, It was shown that in most cases artists want to train their models with datasets of less than 100 images (Elgammal et al., 2020) . Dynamic data-augmentation (Karras et al., 2020a; Zhao et al., 2020) smooths the gap and stabilizes GAN training with fewer images. However, the computing cost from the SOTA models such as StyleGAN2 (Karras et al., 2020b) and BigGAN (Brock et al., 2019) remain to be high, especially when trained with the image resolution on 1024 × 1024. In this paper, our goal is to learn an unconditional GAN on high-resolution images, with low computational cost and few training samples. As summarized in Fig. 2 , these training conditions expose the model to a high risk of overfitting and mode-collapse (Arjovsky & Bottou, 2017; Zhang & Khoreva, 2018) . To train a GAN given the demanding training conditions, we need a generator (G) that can learn fast, and a discriminator (D) that can continuously provide useful signals to train G. To address these challenges, we summarize our contribution as: • We design the Skip-Layer channel-wise Excitation (SLE) module, which leverages lowscale activations to revise the channel responses on high-scale feature-maps. SLE allows a more robust gradient flow throughout the model weights for faster training. It also leads to an automated learning of a style/content disentanglement like StyleGAN2. • We propose a self-supervised discriminator D trained as a feature-encoder with an extra decoder. We force D to learn a more descriptive feature-map covering more regions from an input image, thus yielding more comprehensive signals to train G. We test multiple selfsupervision strategies for D, among which we show that auto-encoding works the best. • We build a computational-efficient GAN model based on the two proposed techniques, and show the model's robustness on multiple high-fidelity datasets, as demonstrated in Fig. 1 . 2021) develop the multi-scale GAN structures to alleviate the gradient flow issue, where G outputs images and receives feedback from several resolutions simultaneously. However, all these approaches further increase the computational cost, consuming even more GPU memory and training time.

2. RELATED WORKS

Stabilize the GAN training: Mode-collapse on G is one of the big challenges when training GANs. And it becomes even more challenging given fewer training samples and a lower computational budget (a smaller batch-size). As D is more likely to be overfitting on the datasets, thus unable to provide meaningful gradients to train G (Gulrajani et al., 2017). Prior works tackle the overfitting issue by seeking a good regularization for D, including different objectives (Arjovsky et al., 2017; Lim & Ye, 2017; Tran et al., 2017) ; regularizing the gradients (Gulrajani et al., 2017; Mescheder et al., 2018) ; normalizing the model weights (Miyato et al., 2018) ; and augmenting the training data (Karras et al., 2020a; Zhao et al., 2020) . However, the effects of these methods degrade fast when the training batch-size is limited, since appropriate batch statistics can hardly be calculated for the regularization (normalization) over the training iterations.



The datasets and code are available at: https://github.com/odegeasslbc/FastGAN-pytorch



Figure 1: Synthetic results on 1024 2 resolution of our model, trained from scratch on single RTX 2080-Ti GPU, with only 1000 images. Left: 20 hours on Nature photos; Right: 10 hours on FFHQ.

Figure 2: The causes and challenges for training GAN in our studied conditions. Speed up the GAN training: Speeding up the training of GAN has been approached from various perspectives. Ngxande et al. propose to reduce the computing time with depth-wise convolutions. Zhong et al. adjust the GAN objective into a min-max-min problem for a shorter optimization path. Sinha et al. suggest to prepare each batch of training samples via a coreset selection, leverage the better data preparation for a faster convergence. However, these methods only bring a limited improvement in training speed. Moreover, the synthesis quality is not advanced within the shortened training time. Train GAN on high resolution: High-resolution training for GAN can be problematic. Firstly, the increased model parameters lead to a more rigid gradient flow to optimize G. Secondly, the target distribution formed by the images on 1024 × 1024 resolution is super sparse, making GAN much harder to converge. Denton et al. (2015); Zhang et al. (2017); Huang et al. (2017); Wang et al. (2018); Karras et al. (2019); Karnewar & Wang (2020); Karras et al. (2020b); Liu et al. (2021) develop the multi-scale GAN structures to alleviate the gradient flow issue, where G outputs images and receives feedback from several resolutions simultaneously. However, all these approaches further increase the computational cost, consuming even more GPU memory and training time.

