CUT-AND-PASTE NEURAL RENDERING

Abstract

Target Scene Cut-and-Paste Cong et al. (2020) Ours Figure 1: We generate realistic renderings of cut-and-paste images. Our method is entirely imagebased and can convincingly reshade/relight fragments with complex surface properties (a lego dozer, a plant and a chair in top row) and matte, glossy and specular fragments (a set of 16 different materials in bottom row) added to a spatially varying illuminated target scene (indoor-outdoor and day-night) without requiring the geometry of the inserted fragment or the parameters of the target scene.



We describe an alternative procedure -cut-and-paste neural rendering, to render the inserted fragment's shading field consistent with the target scene. We use a Deep Image Prior (DIP) as a neural renderer trained to render an image with consistent image decomposition inferences. The resulting rendering from DIP should have an albedo consistent with cut-and-paste albedo; it should have a shading field that, outside the inserted fragment, is the same as the target scene's shading field; and cut-and-paste surface normals are consistent with the final rendering's shading field. The result is a simple procedure that produces convincing and realistic shading. Moreover, our procedure does not require rendered images or image decomposition from real images or any form of labeled annotations in the training. In fact, our only use of simulated ground truth is our use of a pre-trained normal estimator. Qualitative results are strong, supported by a user study comparing against state-of-the-art image harmonization baseline.

1. INTRODUCTION

Cut-and-Paste rendering involves creating a new image by cutting fragments out of one or more source images and pasting them into a target image; the idea originates with Lalonde et al. (2007) . Results are often unrealistic, because of the difference in illumination between the source and target images. But the procedure is useful to artists, and there is consistent evidence that such procedures can be used to train detectors (Liao et al., 2012; Dwibedi et al., 2017) . When the geometry and material of the inserted object are known, it is enough to infer an illumination model from the target, render and composite. But current procedures for recovering shape and material from a single fragment simply can't deal with most realistic fragments (think of, say, a furry cat). This paper describes an alternative method, Cut-and-Paste Neural Rendering, that can render convincing composite images by adjusting the cut-and-paste images so that some simple image inferences are consistent with cut-and-paste predictions. So the albedo from the adjusted image should look like cut-and-paste albedo; the shading should look like a shading field; and the image should look like an image. A simple post-processing trick produces very high-resolution composites. Note that all our rendered images are 1024x1024 pixels resolution and are best viewed on screen. Evaluation is mostly qualitative, but we show that our method fools a recent method for detecting tampering. Our contribution is a method that can realistically correct shading in composite images, without requiring labeled data; our method works for matte, glossy and specular fragments without an explicit geometric or physical model; and human subjects prefer the results of our method over cut-and-paste and image harmonization. 2019) do so by matching contexts. Poisson blending (Pérez et al., 2003; Jia et al., 2006) can resolve nasty boundary artifacts, but significant illumination and color mismatches will cause cross-talk between target and fragment, producing ugly results. Karsch et al. (2011) show that computer graphics (CG) objects can be convincingly inserted into inverse rendering models got with a geometric inference or with single image depth reconstruction (Karsch et al., 2014) . Inverse rendering trained with rendered images can produce excellent reshading of CG objects (Ramachandran, 1988) . However, recovering a renderable model from an image fragment is extremely difficult, particularly if the fragment has an odd surface texture. Liao et al. showed that a weak geometric model of the fragment can be sufficient to correct shading if one has strong geometric information about the target scene (Liao et al., 2015; 2019) . In contrast, our work is entirely image-based: one takes a fragment from one image, drops it into another, and expects a system to correct it.

2. RELATED WORK

We use image harmonization (IH) methods as a strong baseline. These procedures aim to correct corrupted images. IH methods are trained to correct images where a fragment has been adjusted by some noise process (made brighter; recolored; etc.) to the original image (Sunkavalli et al., 2010; Tsai et al., 2017; Cong et al., 2020) , and so could clearly be applied here. But we find those image harmonization methods very often change the albedo of an inserted object, rather than its shading. This is because they rely on ensuring consistency of color representations across the image. For example, in the iHarmony dataset from Cong et al. (2020) , they change pink candy to brown (an albedo change; see Fig 12 in Appendix ). In contrast, we wish to correct shading alone. Image Relighting. With appropriate training data, for indoor-scenes, one can predict multiple spherical harmonic components of illumination (Garon et al., 2019) , or parametric lighting model (Gardner et al., 2019) or even full radiance maps at scene points from images (Song & Funkhouser, 2019; Srinivasan et al., 2020) . For outdoor scenes, the sun's position is predicted in panoramas using a learning-based approach (Hold-Geoffroy et al., 2019) . One can also construct a volumetric radiance field from multi-view data to synthesize novel views (Mildenhall et al., 2020) . However, we do not have access to either training data with lighting parameters/environment maps or multi-view data to construct such a radiance field. Our renderings are entirely image-based. Recent singleimage relighting methods relight portrait faces under directional lighting (Sun et al., 2019; Zhou et al., 2019; Nestmeyer et al., 2020) . Our method can relight matte, gloss and specular objects with



Insertion starts with Lalonde et al. (2007), who insert fragments into target images. Lalonde et al. (2007) control illumination problems by checking fragments for compatibility with targets; Bansal et al. (

