UNCONDITIONAL SYNTHESIS OF COMPLEX SCENES USING A SEMANTIC BOTTLENECK

Abstract

Coupling the high-fidelity generation capabilities of label-conditional image synthesis methods with the flexibility of unconditional generative models, we propose a semantic bottleneck GAN model for unconditional synthesis of complex scenes. We assume pixel-wise segmentation labels are available during training and use them to learn the scene structure through an unconditional progressive segmentation generation network. During inference, our model first synthesizes a realistic segmentation layout from scratch, then synthesizes a realistic scene conditioned on that layout through a conditional segmentation-to-image synthesis network. When trained end-to-end, the resulting model outperforms state-of-the-art generative models in unsupervised image synthesis on two challenging domains in terms of the Fréchet Inception Distance and perceptual evaluations. Moreover, we demonstrate that the end-to-end training significantly improves the segmentationto-image synthesis sub-network, which results in superior performance over the state-of-the-art when conditioning on real segmentation layouts.

1. INTRODUCTION

Significant strides have been made on generative models for image synthesis, with a variety of methods based on Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) achieving stateof-the-art performance. At lower resolutions or in specialized domains, GAN-based methods are able to synthesize samples which are near-indistinguishable from real samples (Brock et al., 2019) . However, generating complex, high-resolution scenes from scratch remains a challenging problem, as shown in Figure 1-(a ) and (b). As image resolution and complexity increase, the coherence of synthesized images decreases -samples lack consistent local or global structures. Stochastic decoder-based models, such as conditional GANs, were recently proposed to alleviate some of these issues. In particular, both Pix2PixHD (Wang et al., 2018) and SPADE (Park et al., 2019) are able to synthesize high-quality scenes using a strong conditioning mechanism based on semantic segmentation labels during the scene generation process. Global structure encoded in the segmentation layout of the scene is what allows these models to focus primarily on generating convincing local content consistent with that structure. A key practical drawback of such conditional models is that they require full segmentation layouts as input. Thus, unlike unconditional generative approaches which synthesize images from randomly sampled noise, these models are limited to generating images from a set of scenes that is prescribed in advance, typically either through segmentation labels from an existing dataset, or scenes that are hand-crafted by experts.

Contributions

To overcome these limitations, we propose a new model, the Semantic Bottleneck GAN (SB-GAN), which couples high-fidelity generation capabilities of label-conditional models with the flexibility of unconditional image generation. This in turn enables our model to synthesize an unlimited number of novel complex scenes, while still maintaining high-fidelity output characteristic of image-conditional models. Our SB-GAN first unconditionally generates a pixel-wise semantic label map of a scene (i.e. for each spatial location it outputs a class label), and then generates a realistic scene image by conditioning on that semantic map, Figure 1-(d) . By factorizing the task into these two steps, we are able We progressively train the adversarial segmentation synthesis network to generate realistic segmentation maps from scratch, then synthesize a photorealistic image using a conditional image synthesis network. End-to-end coupling of these two components results in state-of-the-art unconditional synthesis of complex scenes. to separately tackle the problems of producing convincing segmentation layouts (i.e. a useful global structure) and filling these layouts with convincing appearances (i.e. local structure). When trained end-to-end, the model yields samples which have a coherent global structure as well as fine local details, e.g., Figure 1-(c ). Empirical evaluation shows that our Semantic Bottleneck GAN achieves a new state-of-the-art on two complex datasets with relatively small number of training images, Cityscapes and ADE-Indoor, as measured both by the Fréchet Inception Distance (FID) and by perceptual evaluations. Additionally, we observe that the conditional segmentation-to-image synthesis component of our SB-GAN jointly trained with segmentation layout synthesis significantly improves the state-of-the-art semantic image synthesis network (Park et al., 2019) , resulting in higher-quality outputs when conditioning on ground truth segmentation layouts. Key Challenges While both unconditional generation and image-to-image translation are wellexplored learning problems, fully unconditional generation of the segmentation maps is a notoriously hard task: (i) Semantic categories do not respect any ordering relationships and the network is therefore required to capture the intricate relationship between segmentation classes, their shapes, and their spatial dependencies. (ii) As opposed to RGB values, semantic categories are discrete, hence non-differentiable which poses a challenge for end-to-end training (Sec. 3.2) (iii) Naively combining state-of-the-art unconditional generation and image-to-image translation models leads to poor performance. However, by carefully designing an additional discriminator component and a corresponding training protocol, we not only manage to improve the performance of the end-to-end model, but also the performance of each component separately (Sec. 3.3). We emphasize that despite these challenges our approach scales to 256 × 256 resolution and 95 semantic categories, whereas existing state-of-the-art GAN models directly generating RGB images at that resolution already suffer from considerable instability (Sec. 4).

2. RELATED WORK

Generative Adversarial Networks (GANs) GANs (Goodfellow et al., 2014) are a powerful class of generative models successfully applied to various image synthesis tasks such as image style transfer (Isola et al., 2017; Zhu et al., 2017) , unsupervised representation learning (Chen et al., 2016; Pathak et al., 2016; Radford et al., 2016) , image super-resolution (Ledig et al., 2017; Dong et al., 2016) , and text-to-image synthesis (Zhang et al., 2017; Xu et al., 2018; Qiao et al., 2019b) . Training GANs is notoriously hard and recent efforts focused on improving neural architectures (Wang & Gupta, 2016; Karras et al., 2017; Zhang et al., 2019; Chen et al., 2019a) , loss functions (Arjovsky et al., 2017 ), regularization (Gulrajani et al., 2017; Miyato et al., 2018) , large-scale training (Brock



Figure 1: (a) Examples of non-complex images from ImageNet synthesized by the state-of-theart BigGAN model(Brock et al., 2019). Although these samples look decent, the complex scenes synthesized by BigGAN (e.g., from the Cityscapes dataset) are blurry and defective in local structure (e.g., cars are blended together) (b). Zoom in for more detail. (c) A complex scene synthesized by our model respects both local and global structural integrity of the scene. (d) Schematic of our unconditional Semantic Bottleneck GAN. We progressively train the adversarial segmentation synthesis network to generate realistic segmentation maps from scratch, then synthesize a photorealistic image using a conditional image synthesis network. End-to-end coupling of these two components results in state-of-the-art unconditional synthesis of complex scenes.

