RBPGAN:RECURRENT BACK-PROJECTION GENERA-TIVE ADVERSARIAL NETWORK FOR VIDEO SUPER RES-OLUTION

Abstract

In this paper, we propose a new Video Super Resolution algorithm in an attempt to generate videos that are temporally coherent, spatially detailed, and match human perception. To achieve this, we developed a new generative adversarial network named RBPGAN which is composed of two main components: a generator Network which exceeds other models for producing very high-quality frames, and a discriminator which outperforms others in terms of temporal consistency. The generator of the model uses a reduced recurrent back-projection network that takes a set of neighboring frames and a target frame, applies SISR (Single Image Super Resolution) on each frame, and applies MISR (Multiple Image Super Resolution) through an encoder-decoder Back-Projection based approach to concatenate them and produce x4 resolution version of the target frame. The Spatio-temporal discriminator uses triplets of frames and penalizes the generator to generate the desired results. Our contribution results in a model that outperforms earlier work in terms of perceptual similarity and natural flow of frames, while maintaining temporal coherence and high-quality spatial details.The algorithm was tested on different datasets to eliminate bias.

1. INTRODUCTION

Video Super Resolution (VSR) is the process of generating High Resolution (HR) Videos from Low Resolution (LR) Videos. Videos are of the most common types of media shared in our everyday life. From entertainment purposes like movies to security purposes like security camera videos, videos have become very important. As a result, VSR has also become important. There is a need to modernize old videos or enhance their quality for purposes such as identifying faces in security footages, enhancing satellite captured videos, and having a better experience watching old movies with today's quality. Similar and older than VSR is ISR (Image Super Resolution), which is the process of generating a single high-resolution image from a single low-resolution image. Since a video is understood to be a sequence of frames (images), Video Super Resolution can be seen as Image Super Resolution (ISR) applied to each frame in the video. While this is useful because many of the ISR techniques can be slightly modified to apply to VSR, however, there are major differences between VSR and ISR. The main difference is the temporal dimension in videos that does not exist in images. The relationship between a frame in a video and other frames in the video is the reason why VSR is more complex than ISR (Haris et al., 2019) . In this research, various VSR methods will be explored. The methods are mainly clustered into two clusters, methods with alignment and methods without alignment (Liu et al., 2022) . We will compare between the different methods across different datasets and discuss the results. Out of the methods we studied, we chose 2 models to be the base models for our research paper. We further explore these base models, experiment with them, and discuss possible enhancements. This paper aims to minimize the trade-off between temporal coherence, natural-to-the-eye perception, and quality of VSR. To achieve this, we propose a Generative Adversarial Network (GAN) that uses concepts from different state-of-art models and combines them in a way to achieve our goal.

