LARGE SCALE IMAGE COMPLETION VIA CO-MODUL-ATED GENERATIVE ADVERSARIAL NETWORKS

Abstract

Numerous task-specific variants of conditional generative adversarial networks have been developed for image completion. Yet, a serious limitation remains that all existing algorithms tend to fail when handling large-scale missing regions. To overcome this challenge, we propose a generic new approach that bridges the gap between image-conditional and recent modulated unconditional generative architectures via co-modulation of both conditional and stochastic style representations. Also, due to the lack of good quantitative metrics for image completion, we propose the new Paired/Unpaired Inception Discriminative Score (P-IDS/U-IDS), which robustly measures the perceptual fidelity of inpainted images compared to real images via linear separability in a feature space. Experiments demonstrate superior performance in terms of both quality and diversity over state-of-the-art methods in free-form image completion and easy generalization to image-to-image translation.

1. INTRODUCTION

Generative adversarial networks (GANs) have received a great amount of attention in the past few years, during which a fundamental problem emerges from the divergence of development between image-conditional and unconditional GANs. Image-conditional GANs have a wide variety of computer vision applications (Isola et al., 2017) . As vanilla U-Net-like generators cannot achieve promising performance especially in free-form image completion (Liu et al., 2018; Yu et al., 2019) , a multiplicity of task-specific approaches have been proposed to specialize GAN frameworks, mostly focused on hand-engineered multi-stage architectures, specialized operations, or intermediate structures like edges or contours (Altinel et al., 2018; Ding et al., 2018; Iizuka et al., 2017; Jiang et al., 2019; Lahiri et al., 2020; Li et al., 2020; Liu et al., 2018; 2019a; 2020; Nazeri et al., 2019; Ren et al., 2019; Wang et al., 2018; Xie et al., 2019; Xiong et al., 2019; Yan et al., 2018; Yu et al., 2018; 2019; Yu et al., 2019; Zeng et al., 2019; Zhao et al., 2020a; Zhou et al., 2020) . These branches of works have made significant progress in reducing the generated artifacts like color discrepancy and blurriness. However, a serious challenge remains that all existing algorithms tend to fail when handling large-scale missing regions. This is mainly due to their lack of the underlying generative capability -one can never learn to complete a large proportion of an object so long as it does not have the capability of generating a completely new one. We argue that the key to overcoming this challenge is to bridge the gap between image-conditional and unconditional generative architectures. Recently, the performance of unconditional GANs has been fundamentally advanced, chiefly owing to the success of modulation approaches (Chen et al., 2019; Karras et al., 2019a; b) with learned style representations produced by a latent vector. Researchers also extend the application of modulation approaches to image-conditional GANs with the style representations fully determined by an input image (Park et al., 2019; Huang et al., 2018; Liu et al., 2019b) ; however, the absence of stochasticity makes them hardly generalizable to the settings where only limited conditional information is available. This limitation is fatal especially in large scale image completion. Although some multi-modal unpaired image-to-image translation methods propose to encode the style from another reference image (Huang et al., 2018; Liu et al., 2019b) , this unreasonably assumes that the style representations are entirely independent of the conditional input and hence compromises the consistency. Therefore, we propose co-modulated generative adversarial networks, a generic approach that leverages the generative capability from unconditional modulated architectures, embedding both conditional and stochastic style representations via co-modulation. Co-modulated GANs are thus able to generate diverse and consistent contents and generalize well to not only small-scale inpainting but also extremely large-scale image completion, supporting both regular and irregular masks even with only little conditional information available. See Fig. 1 for qualitative examples. Due to the effectiveness of co-modulation, we do not encounter any problem suffered in the image completion literature (Liu et al., 2018; Yu et al., 2019) , successfully bridging the long-existing divergence. Another major barrier in the image completion literature is the lack of good quantitative metrics. The vast majority of works in this literature seek to improve their performance in terms of similarity-based



Figure 1: Our image completion results w.r.t. different masks. Our method successfully bridges differently conditioned situations, from small-scale inpainting to large-scale completion (left to right). The original images are sampled at 512×512 resolution from the FFHQ dataset (Karras et al., 2019a) within a 10k validation split (top two examples) and the Places2 validation set (Zhou et al., 2017) (bottom two examples). We refer the readers to the appendix for extensive qualitative examples.

