INSTANCE-AWARE IMAGE COMPLETION

Abstract

Image completion is a task that aims to fill in the missing region of a masked image with plausible contents. However, existing image completion methods tend to fill in the missing region with the surrounding texture instead of hallucinating a visual instance that is suitable in accordance with the context of the scene. In this work, we propose a novel image completion model, dubbed Refill, that hallucinates the missing instance that harmonizes well with -and thus preserves -the original context. Refill first adopts a transformer architecture that considers the types, locations of the visible instances, and the location of the missing region. Then, Refill completes the missing foreground and background semantic segmentation masks within the missing region, providing pixel-level semantic and structural guidance to generate missing contents with seamless boundaries. Finally, we condition the image synthesis blocks of Refill using the completed segmentation mask to generate photo-realistic contents to fill out the missing region. Experimental results show the superiority of Refill over state-of-the-art image completion approaches on various natural images.

1. INTRODUCTION

Image completion is the task of restoring the masked regions in an image, which requires an understanding of the unmasked instances and the various relationships among them. Researchers have worked to develop image completion models for practical applications, such as image editing (Jo & Park, 2019; Ling et al., 2021 ), restoration (Wan et al., 2020; Liang et al., 2021) , and object removal (Shetty et al., 2018) . Most previous models, however, focus on filling in the missing region realistically without considering the instance that needs to be restored. For example, we observe that even the cutting-edge image inpainting model (Li et al., 2022) tends to complete the missing region with surrounding textures rather than attempting to restore the lost instance; this limits the usage of image completion models in real-world applications. Removal of a focal instance in a scene can lead to substantial context change. For example, the removal of the horse in the image of Figure 1 changes the local context around the missing region from "a person riding a horse on the beach" to "a boy walking on the beach". HVITA (Qiu et al., 2020) is the only work that tackles such substantial context change, which occurs from the complete removal of a visual instance from the scene. However, HVITA has three major limitations: (1) HVITA mainly targets rectangle masks and thus lacks generalization to other mask forms, (2) the completed image produced by HVITA exhibits abrupt changes along the boundaries between the generated and original regions, and (3) HVITA has a heavy reliance on a refinement network to produce realistic images. To alleviate these issues, we propose a new framework called Refill that leverages a predicted semantic segmentation mask as guidance for image completion. Refill performs image completion in three steps: 1) predicting the class of the missing instance, 2) generating a semantic segmentation mask of the missing region, and 3) completing the masked image using the segmentation guidance. Specifically, Refill predicts the class of the missing instance based on the context of the image, which is determined by mining the inter-instance co-occurrence using a transformer network. Then, Refill generates the segmentation masks of both the missing instance and the background area of the missing region individually using a conditional GAN and transformer body reconstruction network. Finally, by taking the generated segmentation mask as input, our framework generates a context-friendly instance and its background, which fills in the masked image to finally produce a realistic natural image. The proposed context-aware, segmentation-guided image completion framework enables Refill to handle missing regions with arbitrary shapes (such

