SEMANTICALLY-ADAPTIVE UPSAMPLING FOR LAYOUT-TO-IMAGE TRANSLATION

Abstract

We propose the Semantically-Adaptive UpSampling (SA-UpSample), a general and highly effective upsampling method for the layout-to-image translation task. SA-UpSample has three advantages: 1) Global view. Unlike traditional upsampling methods (e.g., Nearest-neighbor) that only exploit local neighborhoods, SA-UpSample can aggregate semantic information in a global view. 2) Semantically adaptive. Instead of using a fixed kernel for all locations (e.g., Deconvolution), SA-UpSample enables semantic class-specific upsampling via generating adaptive kernels for different locations. 3) Efficient. Unlike Spatial Attention which uses a fully-connected strategy to connect all the pixels, SA-UpSample only considers the most relevant pixels, introducing little computational overhead. We observe that SA-UpSample achieves consistent and substantial gains on six popular datasets. The source code will be made publicly available.

1. INTRODUCTION

The layout-to-image translation task aims to translate input layouts to realistic images (see Fig. 1(a) ), which have many real-world applications and draw much attention from the community (Park et al., 2019; Liu et al., 2019; Jiang et al., 2020; Tang et al., 2020) . For example, Park et al. (2019) propose GauGAN with a novel spatially-adaptive normalization to generate realistic images from semantic layouts. Liu et al. (2019) propose CC-FPSE, which predicts convolutional kernels conditioned on the semantic layout and then generate the images. Tang et al. (2020) propose LGGAN with several local generators for generating realistic small objects. Despite the interesting exploration of these methods, we can still observe artifacts and blurriness in their generated images because they always adopt the nearest-neighbor interpolation to upsample feature maps and then to generate final results. Feature upsampling is a key operation in the layout-to-image translation task. Traditional upsampling methods such as nearest-neighbor, bilinear, and bicubic only consider sub-pixel neighborhood (indicated by white circles in Fig. 1(b) ), failing to capture the complete semantic information, e.g., the head and body of the dog, and the front part of the car. Learnable upsampling methods such as Deconvolution (Noh et al., 2015) and Pixel Shuffle Shi et al. ( 2016) are able to obtain the global information with larger kernel size, but learns the same kernel (indicated by the white arrows in Fig. 1(c )) across the image, regardless of the semantic information. Other feature enhancement methods such as Spatial Attention (Fu et al., 2019) can learn different kernels (indicated by different color arrows in Fig. 1(d) ), but it still inevitable captures a lot of redundant information, i.e., 'grasses' and 'soil'. Also, it is prohibitively expensive since it needs to consider all the pixels. To fix these limitations, we propose a novel Semantically-Adaptive UpSampling (SA-UpSample) for this challenging task in Fig. 1 (e). Our SA-UpSample dynamically upsamples a small subset of relevant pixels based on the semantic information, i.e., the green and the tangerine circles represent the pixels within the dog and the car, respectively. In this way, SA-UpSample is more efficient than both Deconvolution, Pixel Shuffle, and Spatial Attention, and can capture more complete semantic information than traditional upsampling methods such as the nearest-neighbor interpolation. We perform extensive experiments on six popular datasets with diverse scenarios and different image resolutions, i.e., Cityscapes (Cordts et al., 2016 ), ADE20K (Zhou et al., 2017) , COCO-Stuff (Caesar et al., 2018 ), DeepFashion (Liu et al., 2016 ), CelebAMask-HQ (Lee et al., 2020 ), and Facades (Tyleček & Šára, 2013) . We show that with the help of SA-UpSample, our framework can synthesize better results compared to several state-of-the-art methods. Moreover, an extensive ablation 1

