TOWARDS ONLINE REAL-TIME MEMORY-BASED VIDEO INPAINTING TRANSFORMERS

Abstract

Video inpainting tasks have seen significant improvements in the past years with the rise of deep neural networks and, in particular, vision transformers. Although these models show promising reconstruction quality and temporal consistency, they are still unsuitable for live videos, one of the last steps to make them completely convincing and usable. The main limitations are that these state-of-the-art models inpaint using the whole video (offline processing) and show an insufficient frame rate. In our approach, we propose a framework to adapt existing inpainting transformers to these constraints by memorizing and refining redundant computations while maintaining a decent inpainting quality. Using this framework with some of the most recent inpainting models, we show great online results with a consistent throughput above 20 frames per second. Code and pretrained models will be made available upon acceptance.

1. INTRODUCTION

Video inpainting is the task of filling missing regions in a video with plausible and coherent content. It can be seen as an extension of the more known image inpainting but with the extra temporal dimension, bringing new challenges. A good video inpainting can be used for various applications, such as object removal (Ebdelli et al., 2015) , video restoration (Lee et al., 2019) or video completion (Chang et al., 2019b) . To be convincing, a video inpainting must be spatially coherent, that is, the content filled in each frame fits with the rest of the image. It must also be temporally coherent, meaning that the video is smooth and without artifacts when being played. Models leveraging a deep learning approach (Kim et al., 2019; Oh et al., 2019; Chang et al., 2019a) have made significant progress recently, especially on the temporal consistency that was lacking from more traditional methods (Wexler et al., 2007; Granados et al., 2012; Newson et al., 2014) . Among them, the transformers (Zeng et al., 2020; Liu et al., 2021a) showed the best performance both in terms of quality and speed. With more and more live content today (e.g. cultural and sport events, social media streaming), online and real-time video inpainting is necessary to deal with these new types of broadcast. Such techniques could also prove to be useful in the augmented perception field. These models should be able to inpaint an ongoing video, with sufficient frame rate to be 'live'. While a few previous works investigated this (Herling & Broll, 2014; Kari et al., 2021) , none of the current state-of-the-art approaches meet the criteria to be called either online or real-time, limiting the potential real-life use cases of this technology. In this work, we propose a framework to adapt the most recent transformer-based techniques of video inpainting to both online and real-time standards, with as little loss of quality as possible.

I. Online

We explore the natural modifications to make any inpainting model work online. By doing that, we derive an online baseline trading off inpainting quality. The main drawback of this approach is the frame rate, which is still too low for real time.

II. Memory

We then add a memory and keep the successive results of these transformers, to reduce the number of calculations to do for the next frames. With that, we increase the number of frames per second by a factor of 3, passing the real-time threshold at the cost of yet another quality drop. III. Refined Finally, we refine the memory-based framework to temper the loss of inpainting quality while maintaining a real-time throughput. To do that, two models run side by side and communicate together as the live video goes on. The first one inpaints the frames in real-time as they come, using as much previous knowledge as it can have. Simultaneously, the second model reinpaints already gone frames with more time and more care. It then communicates its results to the first model, giving it valuable information to use. We demonstrate the proposed techniques (Online, Memory, Refined) on three of the most recent transformer-based models and achieve online real-time operating points when testing on the usual video inpainting tasks and datasets. The remainder of the paper is structured as follows. In Section 2, we give an overview of the former and current research on video inpainting. Then, we detail our online video inpainting models in Section 3. Finally, we report all our results and discuss them in Section 4.

2.1. IMAGE INPAINTING

The first works on image inpainting were proposed decades ago and were relying on the use of known textures to fill the missing content (Bertalmio et al., 2000; Efros & Leung, 1999; Efros & Freeman, 2001) . These textures were sampled from other areas of the image thanks to patches that were matched with the corrupted area with a similarity score. Variations around this approach have then been proposed (Hays & Efros, 2007) including PatchMatch (Barnes et al., 2009) , using an approximation of the patch matching to obtain a tool fast enough for commercial uses. With the development of more complex models such as convolutional neural networks (CNN) (Krizhevsky et al., 2012) , recurrent neural networks (RNN) (He et al., 2016) or generative adversarial networks (GAN) (Goodfellow et al., 2014) , recent image inpainting works have focused on the use of deep neural networks (e.g. encoders) trained with adversarial losses, with great results (Iizuka et al., 2017; Pathak et al., 2016; Yu et al., 2018; Nazeri et al., 2019) .

2.2. TRADITIONAL VIDEO INPAINTING

As for image inpainting, the first models proposed were handcrafted, relying on more traditional image manipulation techniques. Most models adopted a similar patch-based approach in which the video was cut into spatial-temporal patches. A score was also computed between the patches to fill the missing content with information from similar patches (Wexler et al., 2004; 2007; Newson et al., 2014; Patwardhan et al., 2005) . To later deal with videos having more complex movements, refinements have been proposed, for example, introducing the use of flows to help the patch-based inpainting (Huang et al., 2016) . Other techniques not using patches were also proposed, using, for instance, image transformations to align the frames together (Granados et al., 2012) .

2.3. DEEP VIDEO INPAINTING

Deep learning brought significant improvement in the quality of video inpainting in the same way it did with image inpainting. Deep neural networks have been leveraged in various ways to inpaint a video (Ouyang et al., 2021; Lee et al., 2019; Ke et al., 2021) , with most of them belonging to three main categories, as described by Zou et al. (2021) . One way to video inpaint is by employing an encoder-decoder model with a mix of 2D and 3D convolutions (Wang et al., 2019). VINet (Kim et al., 2019) was one of the first deep neural models capable of competing with state-of-the-art models using traditional techniques (Huang et al., 2016) . Improvements were later made by adding gated convolutions (Chang et al., 2019a) and designing a video-specific GAN loss called T- PatchGAN (Chang et al., 2019b) used in many works since then (Zeng et al., 2020; Liu et al., 2021b) . This strategy is, however, impeded by the heavy calculations that 3D convolutions bring. Other models perform video inpainting by utilizing the optical flows of the videos (Xu et al., 2019; Gao et al., 2020) . Forward and backward flows are first computed for the known part of the video

