SEMANTICALLY-ADAPTIVE UPSAMPLING FOR LAYOUT-TO-IMAGE TRANSLATION

Abstract

We propose the Semantically-Adaptive UpSampling (SA-UpSample), a general and highly effective upsampling method for the layout-to-image translation task. SA-UpSample has three advantages: 1) Global view. Unlike traditional upsampling methods (e.g., Nearest-neighbor) that only exploit local neighborhoods, SA-UpSample can aggregate semantic information in a global view. 2) Semantically adaptive. Instead of using a fixed kernel for all locations (e.g., Deconvolution), SA-UpSample enables semantic class-specific upsampling via generating adaptive kernels for different locations. 3) Efficient. Unlike Spatial Attention which uses a fully-connected strategy to connect all the pixels, SA-UpSample only considers the most relevant pixels, introducing little computational overhead. We observe that SA-UpSample achieves consistent and substantial gains on six popular datasets. The source code will be made publicly available.

1. INTRODUCTION

The layout-to-image translation task aims to translate input layouts to realistic images (see Fig. 1(a) ), which have many real-world applications and draw much attention from the community (Park et al., 2019; Liu et al., 2019; Jiang et al., 2020; Tang et al., 2020) . For example, Park et al. (2019) propose GauGAN with a novel spatially-adaptive normalization to generate realistic images from semantic layouts. Liu et al. (2019) propose CC-FPSE, which predicts convolutional kernels conditioned on the semantic layout and then generate the images. Tang et al. (2020) propose LGGAN with several local generators for generating realistic small objects. Despite the interesting exploration of these methods, we can still observe artifacts and blurriness in their generated images because they always adopt the nearest-neighbor interpolation to upsample feature maps and then to generate final results. Feature upsampling is a key operation in the layout-to-image translation task. Traditional upsampling methods such as nearest-neighbor, bilinear, and bicubic only consider sub-pixel neighborhood (indicated by white circles in Fig. 1(b) ), failing to capture the complete semantic information, e.g., the head and body of the dog, and the front part of the car. Learnable upsampling methods such as Deconvolution (Noh et al., 2015) and Pixel Shuffle Shi et al. (2016) are able to obtain the global information with larger kernel size, but learns the same kernel (indicated by the white arrows in Fig. 1(c )) across the image, regardless of the semantic information. Other feature enhancement methods such as Spatial Attention (Fu et al., 2019) can learn different kernels (indicated by different color arrows in Fig. 1(d) ), but it still inevitable captures a lot of redundant information, i.e., 'grasses' and 'soil'. Also, it is prohibitively expensive since it needs to consider all the pixels. To fix these limitations, we propose a novel Semantically-Adaptive UpSampling (SA-UpSample) for this challenging task in Fig. 1 (e). Our SA-UpSample dynamically upsamples a small subset of relevant pixels based on the semantic information, i.e., the green and the tangerine circles represent the pixels within the dog and the car, respectively. In this way, SA-UpSample is more efficient than both Deconvolution, Pixel Shuffle, and Spatial Attention, and can capture more complete semantic information than traditional upsampling methods such as the nearest-neighbor interpolation. We perform extensive experiments on six popular datasets with diverse scenarios and different image resolutions, i.e., Cityscapes (Cordts et al., 2016 ), ADE20K (Zhou et al., 2017) , COCO-Stuff (Caesar et al., 2018) , DeepFashion (Liu et al., 2016) , CelebAMask-HQ (Lee et al., 2020), and Facades (Tyleček & Šára, 2013) . We show that with the help of SA-UpSample, our framework can synthesize better results compared to several state-of-the-art methods. Moreover, an extensive ablation 

2. RELATED WORK

Feature Upsampling. Traditional upsampling methods such as nearest-neighbor and bilinear interpolations use spatial distance and hand-crafted kernels to capture the correlations between pixels. Recently, several deep learning methods such as Deconvolution (Noh et al., 2015) and Pixel Shuffle Shi et al. ( 2016) are proposed to upsample feature maps using learnable kernels. However, these methods either exploit semantic information in a small neighborhood or use a fixed kernel. Some other works of super-resolution, inpainting, denoising (Mildenhall et al., 2018; Wang et al., 2019; Jo et al., 2018; Hu et al., 2019 ) also explore using learnable kernels. However, the settings of these tasks are significantly different from ours, making their methods cannot be used directly. Layout-to-Image Translation tries to convert semantic layouts into realistic images (Park et al., 2019; Liu et al., 2019; Jiang et al., 2020; Tang et al., 2020; Zhu et al., 2020a; Ntavelis et al., 2020; Zhu et al., 2020b) . Although existing methods have generated good images, we still see unsatisfactory aspects mainly in the generated content details and intra-object completions, which we believe is mainly due to they always adopt the nearest-neighbor interpolation to upsample feature maps and then generate final results. To fix this limitation, we propose a novel Semantically-Adaptive Up-Sampling (SA-UpSample) for this task. To the best of our knowledge, we are the first to investigate the influence of feature upsampling on this challenging task.

3. SEMANTICALLY-ADAPTIVE UPSAMPLING (SA-UPSAMPLE)

An illustration of the proposed Semantically-Adaptive UpSampling (SA-UpSample) is shown in Fig. 2 , which mainly consists of two branches, i.e., the Semantically-Adaptive Kernel Generation (SAKG) branch predicting upsample kernels according to the semantic information, and the Semantically-Adaptive Feature Upsampling (SAFU) branch selectively performs the feature upsampling based on the kernels learned in SAKG. All components are trained in an end-to-end fashion so that the two branches can benefit from each other. Specifically, given a feature map f ∈R C×H×W and an upsample scale s, SA-UpSample aims to produce a new feature map f ∈R C×Hs×W s . For any target location l =(i , j ) in the output f , there is a corresponding source location l=(i, j) at the input f , where i= i /s , j= j /s . We denote N (l, k) as the k×k sub-region of f centered at the location l in, i.e., the neighbor of the location l. See Fig. 1 and 2 for illustration.

3.1. SEMANTICALLY-ADAPTIVE KERNEL GENERATION (SAKG) BRANCH

This branch aims to generate a semantically-adaptive kernel at each location according to the semantic information, which consists of four modules, i.e., Feature Channel Compression, Semantic Kernel Generation, Feature Shuffle, and Channel-wise Normalization.



Figure 1: Comparison with different feature upsampling and enhancement methods on the layout-toimage translation task. Given two locations l (indicated by red and megenta squares) in the output feature map f , our goal is to generate these locations by selectively upsampling several points N (l, k) (indicated by circles) in the input feature map f .

