INSTANCE-AWARE IMAGE COMPLETION

Abstract

Image completion is a task that aims to fill in the missing region of a masked image with plausible contents. However, existing image completion methods tend to fill in the missing region with the surrounding texture instead of hallucinating a visual instance that is suitable in accordance with the context of the scene. In this work, we propose a novel image completion model, dubbed Refill, that hallucinates the missing instance that harmonizes well with -and thus preserves -the original context. Refill first adopts a transformer architecture that considers the types, locations of the visible instances, and the location of the missing region. Then, Refill completes the missing foreground and background semantic segmentation masks within the missing region, providing pixel-level semantic and structural guidance to generate missing contents with seamless boundaries. Finally, we condition the image synthesis blocks of Refill using the completed segmentation mask to generate photo-realistic contents to fill out the missing region. Experimental results show the superiority of Refill over state-of-the-art image completion approaches on various natural images.

1. INTRODUCTION

Image completion is the task of restoring the masked regions in an image, which requires an understanding of the unmasked instances and the various relationships among them. Researchers have worked to develop image completion models for practical applications, such as image editing (Jo & Park, 2019; Ling et al., 2021 ), restoration (Wan et al., 2020; Liang et al., 2021) , and object removal (Shetty et al., 2018) . Most previous models, however, focus on filling in the missing region realistically without considering the instance that needs to be restored. For example, we observe that even the cutting-edge image inpainting model (Li et al., 2022) tends to complete the missing region with surrounding textures rather than attempting to restore the lost instance; this limits the usage of image completion models in real-world applications. Removal of a focal instance in a scene can lead to substantial context change. For example, the removal of the horse in the image of Figure 1 changes the local context around the missing region from "a person riding a horse on the beach" to "a boy walking on the beach". HVITA (Qiu et al., 2020) is the only work that tackles such substantial context change, which occurs from the complete removal of a visual instance from the scene. However, HVITA has three major limitations: (1) HVITA mainly targets rectangle masks and thus lacks generalization to other mask forms, (2) the completed image produced by HVITA exhibits abrupt changes along the boundaries between the generated and original regions, and (3) HVITA has a heavy reliance on a refinement network to produce realistic images. To alleviate these issues, we propose a new framework called Refill that leverages a predicted semantic segmentation mask as guidance for image completion. Refill performs image completion in three steps: 1) predicting the class of the missing instance, 2) generating a semantic segmentation mask of the missing region, and 3) completing the masked image using the segmentation guidance. Specifically, Refill predicts the class of the missing instance based on the context of the image, which is determined by mining the inter-instance co-occurrence using a transformer network. Then, Refill generates the segmentation masks of both the missing instance and the background area of the missing region individually using a conditional GAN and transformer body reconstruction network. Finally, by taking the generated segmentation mask as input, our framework generates a context-friendly instance and its background, which fills in the masked image to finally produce a realistic natural image. The proposed context-aware, segmentation-guided image completion framework enables Refill to handle missing regions with arbitrary shapes (such

annex

Context Query: "A person riding a horse on the beach" 0.6416 0.7620 0.5913Figure 1 : From the first column, Input image with a missing region, results of state-of-the-art image completion approaches, such as MAT (Li et al., 2022) and HVITA (Qiu et al., 2020) , Our result (Refill), and the target image. We compute CLIPScore around the generated part using the query text. As our approach generates a horse to complete the image rather than fills using background textures, CLIPScore of our result exhibits the best performance among the other models. as scribbles), unlike HVITA (Qiu et al., 2020) which is best suited for missing rectangular regions. Note that Refill avoids the need for a refinement network, unlike HVITA, which heavily relies on the performance of its refinement network.To evaluate and compare our model against existing methods, we first employ an off-the-shelf image captioning network, OFA (Wang et al., 2022) , to produce a caption for each missing region of the masked images. We propose to use the produced caption as the context query, which represents the context of the missing regions. To measure how much the context of the image changes after completion, we propose to use two evaluation metrics: (1) CLIPScore (Schönfeld et al., 2021) , which employs CLIP visual and textual encoders to determine whether the generated image region is well aligned with the context query, and (2) Visual Grounding Accuracy (VGA), which uses a pretrained visual grounding model (Wang et al., 2022) to determine whether the context query can successfully ground the generated image region. We also evaluate our method using conventional image quality assessment metrics, including FID (Heusel et al., 2017) and LPIPS (Zhang et al., 2018) . On COCO-panoptic (Lin et al., 2014)/Visual Genome (Krishna et al., 2017) datasets, Refill shows comparable visual quality (FID=7.284/5.849) to the state-of-the-art image completion approach such as MAT (Li et al., 2022) , while Visual Grounding Accuracy and CLIPScore are 12.472/14.107% and 0.027/0.029 better than HVITA. These results demonstrate that our approach can complete the missing regions of masked images in a context-friendly manner to yield high-quality images.Our contributions are summarized as follows:• We propose a novel framework called Refill which completes the missing region of masked images in a context-friendly manner, preserving the original context by leveraging a segmentation mask to encourage visual consistency between the generated and unmasked areas without relying on a refinement network.• We present a novel combination of two transformer-based modules which facilitates our context-aware image completion pipeline. The missing instance inference transformer predicts the class of the missing instance effectively. The transformer-body background segmentation completion network shows better-recovered segmentation masks, especially under the presence of large missing regions.• We propose to adopt CLIPScore and VGA to evaluate the context consistency between the original image and the completed image.• Refill produces new visual instances in missing regions that are visually consistent with the unmasked areas. Refill also shows better performance in CLIPScore and VGA metrics compared to the baselines and exhibits comparable FID performance compared to the state-of-the-art approaches.

