RBPGAN:RECURRENT BACK-PROJECTION GENERA-TIVE ADVERSARIAL NETWORK FOR VIDEO SUPER RES-OLUTION

Abstract

In this paper, we propose a new Video Super Resolution algorithm in an attempt to generate videos that are temporally coherent, spatially detailed, and match human perception. To achieve this, we developed a new generative adversarial network named RBPGAN which is composed of two main components: a generator Network which exceeds other models for producing very high-quality frames, and a discriminator which outperforms others in terms of temporal consistency. The generator of the model uses a reduced recurrent back-projection network that takes a set of neighboring frames and a target frame, applies SISR (Single Image Super Resolution) on each frame, and applies MISR (Multiple Image Super Resolution) through an encoder-decoder Back-Projection based approach to concatenate them and produce x4 resolution version of the target frame. The Spatio-temporal discriminator uses triplets of frames and penalizes the generator to generate the desired results. Our contribution results in a model that outperforms earlier work in terms of perceptual similarity and natural flow of frames, while maintaining temporal coherence and high-quality spatial details.The algorithm was tested on different datasets to eliminate bias.

1. INTRODUCTION

Video Super Resolution (VSR) is the process of generating High Resolution (HR) Videos from Low Resolution (LR) Videos. Videos are of the most common types of media shared in our everyday life. From entertainment purposes like movies to security purposes like security camera videos, videos have become very important. As a result, VSR has also become important. There is a need to modernize old videos or enhance their quality for purposes such as identifying faces in security footages, enhancing satellite captured videos, and having a better experience watching old movies with today's quality. Similar and older than VSR is ISR (Image Super Resolution), which is the process of generating a single high-resolution image from a single low-resolution image. Since a video is understood to be a sequence of frames (images), Video Super Resolution can be seen as Image Super Resolution (ISR) applied to each frame in the video. While this is useful because many of the ISR techniques can be slightly modified to apply to VSR, however, there are major differences between VSR and ISR. The main difference is the temporal dimension in videos that does not exist in images. The relationship between a frame in a video and other frames in the video is the reason why VSR is more complex than ISR (Haris et al., 2019) . In this research, various VSR methods will be explored. The methods are mainly clustered into two clusters, methods with alignment and methods without alignment (Liu et al., 2022) . We will compare between the different methods across different datasets and discuss the results. Out of the methods we studied, we chose 2 models to be the base models for our research paper. We further explore these base models, experiment with them, and discuss possible enhancements. This paper aims to minimize the trade-off between temporal coherence, natural-to-the-eye perception, and quality of VSR. To achieve this, we propose a Generative Adversarial Network (GAN) that uses concepts from different state-of-art models and combines them in a way to achieve our goal. Our methodology, experimentation and results are mentioned in this paper respectively. Finally, we conclude the paper and propose future work recommendations.

2. RELATED WORK

Based on our review of the literature, the Deep Learning based methods that target Video Super Resolution problem can be divided into 2 main categories: methods with alignment, and methods without alignment. Alignment basically means that the input LR video frames should be aligned first before feeding them into the model. Under the methods with alignment, existing models can be divided into two sub-categories: methods with Motion Estimation and Motion Compensation (MEMC), and methods with Deformable Convolution (DC). Under methods without alignment, existing models can be divided into 4 sub-categories: 2D convolution, 3D convolution, RNNs, and Non-Local based. In this section, the out-performing methods belonging to every category will be discussed.

2.1.1. MOTION ESTIMATION AND MOTION COMPENSATION (MEMC)

First, the Temporally Coherent Generative Adversarial Network (TecoGAN) (Chu et al., 2020) The network proposes a temporal adversarial learning method for a recurrent training approach that can solve problems like Video Super Resolution, maintaining the temporal coherence and consistency of the video without losing any spatial details, and without resulting in any artifacts or features that arbitrarily appear and disappear over time. The TecoGAN model is tested on different datasets, including the widely used Vid4, and it is compared to the state-of-the-arts ENet, FRVSR, DUF, RBPN, and EDVR. TecoGAN is able to generate improved and realistic details in both down-sampled and captured images.However, one limitation of the model is that it can lead to temporally coherent yet sub-optimal details in certain cases such as under-resolved faces and text. Second, the recurrent back-projection network (RBPN) (Haris et al., 2019) . This architecture mainly consists of one feature extraction module, a projection module, and a reconstruction module. The recurrent encoder-decoder module integrates spatial and temporal context from continuous videos. This architecture represents the estimated inter-frame motion with respect to the target rather than explicitly aligning frames. This method is inspired by back-projection for MISR, which iteratively calculates residual images as reconstruction error between a target image and a set of its corresponding images. These residual blocks get projected back to the target image to improve its resolution. This solution integrated SISR and MISR in a unified VSR framework as SISR iteratively extracted various feature maps representing the details of a target frame while the MISR were used to get a set of feature maps from other frames. This approach reported extensive experiments for VSR and used different datasets with different specs to conduct detailed evaluation of its strength and weaknesses. This model reported significant results in terms of the quality of produced videos.

2.1.2. DEFORMABLE CONVOLUTION METHODS (DC)

The Enhanced Deformable Video Restoration (EDVR) (Wang et al., 2019) is a framework that performs different video super-resolution and restoration tasks. The architecture of EDVR is composed of main modules known Pyramid, Cascading, and Deformable convolutions (PCD) and Temporal and Spatial Attention (TSA). The concept of deformable convolution is built on the idea of having sampled frames, augmenting their spatial locations, and learning additional offsets to these locations without supervision. This model results in very good results, however, it's size and number of parameters are very high.

2.2.1. 2D AND 3D CONVOLUTION

2D convolution based VSR mainly uses the classic convolution method to extract information from the frames of the video in only the spatial dimension and increasing the resolution accordingly (Lu-

