NOT-SO-BIG-GAN: GENERATING HIGH-FIDELITY IM-AGES ON SMALL COMPUTE WITH WAVELET-BASED SUPER-RESOLUTION

Abstract

State-of-the-art models for high-resolution image generation, such as BigGAN and VQVAE-2, require an incredible amount of compute resources and/or time (512 TPU-v3 cores) to train, putting them out of reach for the larger research community. On the other hand, GAN-based image super-resolution models, such as ESRGAN, can not only upscale images to high dimensions, but also are efficient to train. In this paper, we present NOT-SO-BIG-GAN (NSB-GAN), a simple yet cost-effective two-step training framework for deep generative models (DGMs) of high-dimensional natural images. First, we generate images in low-frequency bands by training a sampler in the wavelet domain. Then, we super-resolve these images from the wavelet domain back to the pixel-space with our novel wavelet super-resolution decoder network. Wavelet-based down-sampling method preserves more structural information than pixel-based methods, leading to significantly better generative quality of the low-resolution sampler (e.g., 64×64). Since the sampler and decoder can be trained in parallel and operate on much lower dimensional spaces than end-to-end models, the training cost is substantially reduced. On ImageNet 512×512, our model achieves a Fréchet Inception Distance (FID) of 10.59 -beating the baseline BigGAN model -at half the compute (256 TPU-v3 cores).

1. INTRODUCTION

Generative modeling of natural images has achieved great success in recent years (Kingma & Welling, 2013; Goodfellow et al., 2014; Arjovsky et al., 2017; Menick & Kalchbrenner, 2019; Zhang et al., 2018a) . Advancements in scalable computing and theoretical understanding of generative models (Miyato et al., 2018; Zhang et al., 2018a; Gulrajani et al., 2017; Mescheder et al., 2018; 2017; Roth et al., 2017; Nowozin et al., 2016; Srivastava et al., 2017; 2020; Karras et al., 2020) , have, for the first time, enabled the state-of-the-art techniques to generate photo-realistic images in higher dimensions than ever before (Brock et al., 2018; Razavi et al., 2019; Karras et al., 2020 ). Yet, generating high-dimensional complex data, such as ImageNet, still remains challenging and extremely resource intensive. At the forefront of high-resolution image generation is BigGAN (Brock et al., 2018) , a generative adversarial network (GAN) (Goodfellow et al., 2014) that tackles the curse of dimensionality (CoD) head-on, using the latest in scalable GPU-computing. This allows for training BigGAN with large mini-batch sizes (e.g., 2048), which greatly helps to model highly diverse, large-scale datasets like ImageNet. But, BigGAN's ability to scale to high-dimensional data comes at the cost of a hefty compute budget. A standard BigGAN model at 256×256 resolution can require up to a month or more of training time on as many as eight Tesla V100 graphics processing units (GPUs). This compute requirement raises the barrier to entry for using and improving upon these technologies as the wider research community may not have access to any specialized hardware (e.g., Tensor processing units (TPUs) (Jouppi et al., 2017) . The environmental impact of training large-scale models can also be substantial as training BigGAN on 512×512 images with 512 TPU cores for two days reportedly used as much electricity as the average American household does in about six months (Schwab, 2018) . Motivated by these problems, we present NOT-SO-BIG-GAN (NSB-GAN), a small compute training alternative to BigGAN, for class-conditional modeling of high-resolution images. In end-to-end generative models of high-dimensional data, such as VQVAE-2 (Razavi et al., 2019) and Karras et al. ( 2017), the lower layers transform noise into low resolution images, which are subsequently upscaled i.e. super-resolved to higher dimensions in the higher layers. Based on this insight, in NSB-GAN we propose to split the end-to-end generative model into two separate neural networks, a sampler and an up-scaling decoder that can be trained in parallel on much smaller dimensional spaces. In turn, we drastically reduce the compute budget of training. This split allows the sampler to be trained in up to 16-times lower dimensional space, not only making it compute efficient, but also alleviating the training instability of end-to-end approaches. To this end, we propose waveletspace training of GANs. As compared to pixel-based interpolation methods for down-sampling images, wavelet-transform (WT) (Haar, 1909; Daubechies, 1992; Antonini et al., 1992) based downsampling preserves much more structural information, leading to much better samplers in fairly low resolutions (Sekar et al., 2014) . When applied to a 2D image, wavelet transform slices the image into four equally-sized image-like patches along different frequency bands. This process can be recursively applied multiple times in order to slice a large image into multiple smaller images, each representing the entire image in different bandwidths. This is diagrammatically shown in Figure 1 . Here, the top-left patch (TL) lies in the lowest frequency band and contains most of the structure of the original image and therefore the only patch preserved during downsampling. The highly sparse top-right (TR), bottom-left (BL) and bottom-right (BL) patches lie in higher bands of frequency and are therefore dropped. But wavelet-space sampling prohibits the use of existing pixel-space superresolution models, such as Ledig et al. ( 2017); Wang et al. ( 2018), to upscale the samples. Thus, we introduce two wavelet-space super-resolution decoder networks that can work directly with waveletspace image encoding, while matching the performance of equivalent pixel-space methods. Training our decoders is extremely compute efficient (e.g., 3 days on the full ImageNet dataset), and, once trained on a diverse dataset like ImageNet, can generalize beyond the original training resolution and dataset. 

2. BACKGROUND AND RELATED WORK

Given a set X = {x i |∀i ∈ {1, . . . N }, x i ∈ R D } of samples from the true data distribution p(x), the task of deep generative modeling is to approximate this distribution using deep neural networks. All generative models of natural images assume that p(x) is supported on a K-dimensional, smaller



Figure1: Wavelet transformation consists of a low-pass and a high-pass filter, followed by a downsampling step that splits the image into two equal-sized patches. Each of these two patches undergo the same operation again resulting in four equal-sized patches, TL, TR, BL and BR.

