SEMANTIC IMAGE MANIPULATION WITH BACKGROUND-GUIDED INTERNAL LEARNING Anonymous authors Paper under double-blind review

Abstract

Image manipulation has attracted a lot of interest due to its wide range of applications. Prior work modifies images either from pixel-level manipulation, such as image inpainting or through manual edits via paintbrushes and scribbles, or from high-level manipulation, employing deep generative networks to output an image conditioned on high-level semantic input. In this study, we propose Semantic Image Manipulation with Background-guided Internal Learning (SIMBIL), which combines high-level and pixel-level manipulation. Specifically, users can edit an image at the semantic level by applying changes on the scene graph. Then our model manipulates the image at the pixel level according to the modified scene graph. There are two major advantages of our approach. First, high-level manipulation requires less manual effort from the user compared to manipulating raw image pixels. Second, our pixel-level internal learning approach is scalable to images of various sizes without reliance on external visual datasets for training. We outperform the state-of-the-art in a quantitative and qualitative evaluation on CLEVR and Visual Genome datasets. Experiments show around 8 points improvement of SSIM (RoI) on CLEVR and we found human users preferred our manipulated images over prior work by 9-33% on Visual Genome, demonstrating the effectiveness of our approach.

1. INTRODUCTION

Image manipulation modifies the content of an image according to user guidance. The task can be solved in two primary ways: pixel-level manipulation on raw images and high-level manipulation on image semantics. Pixel-level manipulation spans image inpainting (Zhao et al., 2019; Yeh et al., 2017) , colorization (Zhang et al., 2016) , object removal (Shetty et al., 2018) , style transfer (Gatys et al., 2016) , image extension (Teterwak et al., 2019) , etc. Pixel-level manipulation methods do not need to understand the semantic meanings of an image. In contrast, high-level manipulation often uses deep generative networks conditioned on user inputs like semantic maps and language descriptions to identify the desired modifications. Most prior work for high-level image manipulation are object-centric, such as human face transfer (Choi et al., 2018; Lee et al., 2020; Jo & Park, 2019; Zhao et al., 2018) and object appearance or attribute modification (Li et al., 2020a; Liang et al., 2018) . Recently, approaches modifying the entire scenes by instance maps (Wang et al., 2018) , language descriptions (El-Nouby et al., 2019; Nichol et al., 2021; Avrahami et al., 2022) or scene graphs (Dhamo et al., 2020) are also proposed. Although high-level manipulation requires less manual effort from users, deep generative networks for high-level manipulation have two drawbacks. First, high-level manipulation frameworks often only support outputting low-resolution images due to GPU memory requirements (Dhamo et al., 2020) . Super-resolution modules (Saharia et al., 2022; Nichol et al., 2021) are required to get higher-resolution images, introducing extra overhead. Second, generative models may result in the loss of attributes and details of the original images (Bau et al., 2020) . Ideally, a good image manipulation method should satisfy the following requirements: (1) provide maximum convenience to users; for example, manipulating images by scene graphs or language description is more convenient than manually segmenting, replacing, or removing the target object, (2) preserve the textures and details of the original image in appropriate areas, (3) correctly modify the target region of the image according to user instructions, (4) ability to generalize across input images without relying on specific external datasets. There are two major challenges to developing an approach that can satisfy these requirements. First, it is challenging for existing text-driven image editing methods to accurately localize the Region of Interest (RoI)foot_0 at complex scenes. E.g., popular frameworks including GLIDE (Nichol et al., 2021) and blended diffusion model (Avrahami et al., 2022) require users to manually select RoI. Methods (Li et al., 2020a;b) that do not require bounding boxes as input are mostly object-centric and the images do not contain complex semantic relationships between objects. The ambiguity of text makes developing an RoI prediction model challenging. For example, if there are multiple birds in an image, locating the target bird according to text descriptions would be challenging even for a human. To solve this issue, we use scene graph information to eliminate the ambiguity of text, while still making manipulations easy. Second, most image inpainting methods (Nazeri et al., 2019; Yu et al., 2019; Rombach et al., 2022) trained their models based on reconstruction task. In this case, as we will show in Section 4.2 and Section 4.3, these external learning methods tend to repair the target object even if the user command is to remove the object. We further introduce internal learning to avoid the object repair issue. Specifically, we propose a Semantic Image Manipulation framework with Background-guided Internal Learning (SIMBIL). SIMBIL combines high-level image semantics with pixel-level manipulation. Figure 1 illustrates the difference between SIMBIL and prior work by an object relationship change example. Figure 2 presents the overall structure of SIMBIL. First, the target object is determined by the scene graph of an image. The users are able to edit the nodes and edges of scene graphs for four operations, object removal, object replacement, semantic relationship change, and object addition. We use a segmentation module to outline the mask of target object. A Recurrent Neural Networks (RNN)-based module further encodes the semantic modifications between the objects and predicts the target Region of Interest (RoI) according to editing operations. Finally, we improve Deep Image Prior (DIP) (Ulyanov et al., 2018) by utilizing background pixels as a constraint and propose the background-guided internal learning module. In summary, the contributions of this paper are: • We propose a semantic image manipulation framework (SIMBIL) to combine high-level semantics with pixel-level image manipulation, reducing manual effort and alleviating the issues caused by prior work. Notably, compared to existing manipulation methods using scene graphs (Dhamo et al., 2020) , SIMBIL can generate higher resolution images while accurately preserving the original details of the input images. • We develop a background-guided internal learning algorithm based on DIP (Ulyanov et al., 2018) for image inpainting, which utilizes the average value of the background pixels around the missing part as guidance as opposed to only relying on the implicit prior captured by the neural network parameterization, boosting performance.



We use RoI to indicate the region that is supposed to be edited in the image.



Figure 1: Prior work of image manipulation is either at pixel-level (e.g., EdgeConnect (Nazeri et al., 2019)), shown in (a), or high-level (e.g., ManiGAN (Li et al., 2020a)), shown in (b). In our work, shown in (c), we address the issues of prior work (see Section 1 for discussion) by connecting highlevel semantics with pixel-level manipulation, where the semantic level information is encoded by an RNN-based scene-graph encoder. Then the pixel-level manipulation, background-guided internal learning, is done according to the processed information.

