SEMANTIC IMAGE MANIPULATION WITH BACKGROUND-GUIDED INTERNAL LEARNING Anonymous authors Paper under double-blind review

Abstract

Image manipulation has attracted a lot of interest due to its wide range of applications. Prior work modifies images either from pixel-level manipulation, such as image inpainting or through manual edits via paintbrushes and scribbles, or from high-level manipulation, employing deep generative networks to output an image conditioned on high-level semantic input. In this study, we propose Semantic Image Manipulation with Background-guided Internal Learning (SIMBIL), which combines high-level and pixel-level manipulation. Specifically, users can edit an image at the semantic level by applying changes on the scene graph. Then our model manipulates the image at the pixel level according to the modified scene graph. There are two major advantages of our approach. First, high-level manipulation requires less manual effort from the user compared to manipulating raw image pixels. Second, our pixel-level internal learning approach is scalable to images of various sizes without reliance on external visual datasets for training. We outperform the state-of-the-art in a quantitative and qualitative evaluation on CLEVR and Visual Genome datasets. Experiments show around 8 points improvement of SSIM (RoI) on CLEVR and we found human users preferred our manipulated images over prior work by 9-33% on Visual Genome, demonstrating the effectiveness of our approach.

1. INTRODUCTION

Image manipulation modifies the content of an image according to user guidance. The task can be solved in two primary ways: pixel-level manipulation on raw images and high-level manipulation on image semantics. Pixel-level manipulation spans image inpainting (Zhao et al., 2019; Yeh et al., 2017 ), colorization (Zhang et al., 2016 ), object removal (Shetty et al., 2018) , style transfer (Gatys et al., 2016 ), image extension (Teterwak et al., 2019) , etc. Pixel-level manipulation methods do not need to understand the semantic meanings of an image. In contrast, high-level manipulation often uses deep generative networks conditioned on user inputs like semantic maps and language descriptions to identify the desired modifications. Most prior work for high-level image manipulation are object-centric, such as human face transfer (Choi et al., 2018; Lee et al., 2020; Jo & Park, 2019; Zhao et al., 2018) and object appearance or attribute modification (Li et al., 2020a; Liang et al., 2018) . Recently, approaches modifying the entire scenes by instance maps (Wang et al., 2018) , language descriptions (El-Nouby et al., 2019; Nichol et al., 2021; Avrahami et al., 2022) or scene graphs (Dhamo et al., 2020) are also proposed. Although high-level manipulation requires less manual effort from users, deep generative networks for high-level manipulation have two drawbacks. First, high-level manipulation frameworks often only support outputting low-resolution images due to GPU memory requirements (Dhamo et al., 2020) . Super-resolution modules (Saharia et al., 2022; Nichol et al., 2021) are required to get higher-resolution images, introducing extra overhead. Second, generative models may result in the loss of attributes and details of the original images (Bau et al., 2020) . Ideally, a good image manipulation method should satisfy the following requirements: (1) provide maximum convenience to users; for example, manipulating images by scene graphs or language description is more convenient than manually segmenting, replacing, or removing the target object, (2) preserve the textures and details of the original image in appropriate areas, (3) correctly modify the target region of the image according to user instructions, (4) ability to generalize across input images without relying on specific external datasets. There are two major challenges to developing an approach that can satisfy these requirements. First, it is challenging for existing text-driven

