TOWARDS ONLINE REAL-TIME MEMORY-BASED VIDEO INPAINTING TRANSFORMERS

Abstract

Video inpainting tasks have seen significant improvements in the past years with the rise of deep neural networks and, in particular, vision transformers. Although these models show promising reconstruction quality and temporal consistency, they are still unsuitable for live videos, one of the last steps to make them completely convincing and usable. The main limitations are that these state-of-the-art models inpaint using the whole video (offline processing) and show an insufficient frame rate. In our approach, we propose a framework to adapt existing inpainting transformers to these constraints by memorizing and refining redundant computations while maintaining a decent inpainting quality. Using this framework with some of the most recent inpainting models, we show great online results with a consistent throughput above 20 frames per second. Code and pretrained models will be made available upon acceptance.

1. INTRODUCTION

Video inpainting is the task of filling missing regions in a video with plausible and coherent content. It can be seen as an extension of the more known image inpainting but with the extra temporal dimension, bringing new challenges. A good video inpainting can be used for various applications, such as object removal (Ebdelli et al., 2015 ), video restoration (Lee et al., 2019) or video completion (Chang et al., 2019b) . To be convincing, a video inpainting must be spatially coherent, that is, the content filled in each frame fits with the rest of the image. It must also be temporally coherent, meaning that the video is smooth and without artifacts when being played. Models leveraging a deep learning approach (Kim et al., 2019; Oh et al., 2019; Chang et al., 2019a) have made significant progress recently, especially on the temporal consistency that was lacking from more traditional methods (Wexler et al., 2007; Granados et al., 2012; Newson et al., 2014) . Among them, the transformers (Zeng et al., 2020; Liu et al., 2021a) showed the best performance both in terms of quality and speed. With more and more live content today (e.g. cultural and sport events, social media streaming), online and real-time video inpainting is necessary to deal with these new types of broadcast. Such techniques could also prove to be useful in the augmented perception field. These models should be able to inpaint an ongoing video, with sufficient frame rate to be 'live'. While a few previous works investigated this (Herling & Broll, 2014; Kari et al., 2021) , none of the current state-of-the-art approaches meet the criteria to be called either online or real-time, limiting the potential real-life use cases of this technology. In this work, we propose a framework to adapt the most recent transformer-based techniques of video inpainting to both online and real-time standards, with as little loss of quality as possible.

I. Online

We explore the natural modifications to make any inpainting model work online. By doing that, we derive an online baseline trading off inpainting quality. The main drawback of this approach is the frame rate, which is still too low for real time.

II. Memory

We then add a memory and keep the successive results of these transformers, to reduce the number of calculations to do for the next frames. With that, we increase the number of frames per second by a factor of 3, passing the real-time threshold at the cost of yet another quality drop.

